top | item 45354051

(no title)

trws | 5 months ago

I liked the article. I saw your PS that we added it to the working draft for c++26, we also made it part of OpenMP as of 5.0 I think. It’s sometimes a hardware atomic like on arm, but what made the case was that it’s common to implement it sub-optimally even on x86 or LL-SC architectures. Often the generic cas loop gets used, like in your lambda example, but it lacks an early cutout since you can ignore any input value that’s on the wrong side of the op by doing a cheap atomic read or just cutting out of the loop after the first failed CAS if the read back shows it can’t matter. Also can benefit from using slightly different memory orders than the default on architectures like ppc64. It’s a surprisingly useful op to support that way.

If this kind of thing floats your boat, you might be interested in the non-reading variants of these as well. Mostly for things like add, max, etc but some recent architectures actually offer alternate operations to skip the read-back. The paper calls them “atomic reduction operations” https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p31...

discuss

anematode|5 months ago

Curious: even with hardware atomics, wouldn't it be a good idea to first perform a non-atomic load to check for whether the store might be necessary (which would require the cache line to be locked), then only run the atomic max if it might change the value?

adwn|5 months ago

Yes, this can make sense if

- the value is often doesn't require an update, and

- there's contention on the cache line, i.e., at least two cores frequently read or write that cache line.

But there are important details to consider:

1) The probing load must be atomic. Both the compiler and the processor in general are allowed to split non-atomic loads into two or more partial loads. Only atomic loads – even with relaxed ordering – are guaranteed to not return intermediate or mixed values from other atomic stores.

2) If the ordering on the read part of the atomic read-modify-write operation is not relaxed, the probing load must reflect this. For example, an acq-rel RMW op would require an acquire ordering on the probing read.

adgjlsfhk1|5 months ago

This depends heavily on what concurrency optimizations your processor implements (and unfortunately this is the sort of thing that doesn't get doccumented and is somewhat hard to test).

SkiFire13|5 months ago

> but it lacks an early cutout since you can ignore any input value that’s on the wrong side of the op by doing a cheap atomic read or just cutting out of the loop after the first failed CAS if the read back shows it can’t matter.

I believe this is a bit trickier than that, you would also need at least some kind of atomic barrier to preserve the ordering semantics of the successful update case.