Skip to content
James McMullan edited this page Sep 18, 2024 · 4 revisions

Performance of Atomics

Atomics, being lock-free, are generally very fast. Incrementing an Atomic counter has nearly the same performance as incrementing a typical integer when accessed from a single core. However, the performance becomes more complex when multiple CPU cores are involved, especially if they are accessing the same Atomic or if the Atomic resides in a shared cache line.

When multiple CPU cores modify the same Atomic, each core must first gain access to the cache line storing the Atomic. This often results in L1 or L2 cache misses, with each core waiting for the cache line to be transferred from one of its sibling cores. A few cache misses here and there may not significantly affect performance, but when many threads are involved and all need the same cache line, a scenario can arise where CPU cores become stalled, repeatedly passing the same cache line back and forth.

We encountered this exact scenario recently when using the FileUtility in HPCC4j to debug an issue. Initially, read performance was normal, but after increasing the number of reading threads past a certain threshold, but we hit a hard limit and per-thread read performance plummeted.

This was initial interpreted as a network bandwidth issue or limitation but the OpenTelemetry tracing we've recently incorporated into HPCC4j gave us better insight into the issue. While we observed a drop in the streaming rate across the network consistent with a network bottleneck, this slowdown occurred after a drop in the record construction rate on the record construction threads! Given that: we had plenty of CPU cores available, each record processing thread operates independently and threads don't share resources, a bottleneck that worsens with more threads was unexpected.

Upon further investigation, we found the issue in the FileUtility debug tool: a shared AtomicLong used to track record counts. By modifying the design to keep track of record counts per thread and only updating the shared AtomicLong after each data partition, we achieved a 6x throughput improvement in this scenario.

For more details, see the PR here: https://github.com/hpcc-systems/hpcc4j/pull/756