top | item 25263331

(no title)

Even after watching these videos and reading lots of articles on the topic, I still find the full C++ memory model extremely hard to understand. However, on x86 there are actually only a couple of things that one needs to understand to write correct lock free code. This is laid out in a blog post: https://databasearchitects.blogspot.com/2020/10/c-concurrenc...

discuss

dragontamer|5 years ago

C++'s "seq_cst" model is simple. If you're having any issues understanding anything at all, just stick with seq_cst.

If you want slightly better performance on some processors, you need to dip down into acquire-release. This 2nd memory model is faster because of the concept of half-barriers.

Lets say you have:

    a();
    b();
    acquire_barrier(); // Half barrier
    c();
    d();
    e();
    release_barrier(); // Half barrier
    f(); 
    g();

The compiler, CPU, and cache is ALLOWED to rearrange the code into the following:

    acquire_barrier(); // Optimizer moved a() and b() from outside the barrier to inside the barrier
    a();
    b();
    d();
    c();
    e();
    g();
    f();
    release_barrier(); // Optimizer moved g() and f() from outside the barrier to inside the barrier

You're allowed to move optimizations "inside" towards the barrier, but you are not allowed to rearrange code "outside" of the half-barrier region. Because more optimizations are available (for the compiler, the CPU, or the caches), half-barriers execute slightly faster than full sequential consistency.

----------

Now that we've talked about things in the abstract, lets think about "actual" code. Lets say we have:

    int i = 0; // a();
    i++; // b();

    full_barrier(); // seq_cst barrier

    i+=2; // c();
    i+=3; // d();
    i+=4; // e();

    full_barrier(); // seq_cst barrier

    i+=5; // f();
    i+=6; // g();

As the optimizer, you're only allowed to optimize to...

    int i = 1; // a() and b() rearranged to the same line
    full_barrier(); // Not allowed to optimize past this line
    i+= 9; // c, d, and e rearranged
    full_barrier();
    i+= 11; // f, g rearranged.

Now lets do the same with half barriers:

    int i = 0; // a();
    i++; // b();

    acquire_barrier(); // acquire

    i+=2; // c();
    i+=3; // d();
    i+=4; // e();

    release_barrier(); // release

    i+=5; // f();
    i+=6; // g();

Because all code can be rearranged to the "inside" of the barrier, you can simply write:

   i = 21;

Therefore, half-barriers are faster.

----------

Now instead of the compiler rearranging code: imagine the L1 cache is rearranging writes to memory. With full barriers, the L1 cache has to write:

    i = 1;
    full_barrier(); // Ensure all other cores see that i is now = 1;

    i = 10; // L1 cache allows CPU to do +2, +3, and +4 operations, but L1 "merges them together" and other cores do NOT see the +2, +3, or +4 operations

    full_barrier(); // L1 cache communicates to other cores that i = 10 now;

    i = 21; // L1 cache allows CPU to do +5 and +6 operations

   // Without a barrier, L1 cache doesn't need to tell anyone that i is 21 now. No communication is guaranteed.

----------

Similarly, with half-barriers instead, the L1 cache's communication to other cores only has to be:

    i = 21; // L1 cache can "lazily" inform other cores, allowing the CPU to perform i+=1, i+=2... i+=6.

So for CPUs that implement half-barriers (like ARM), the L1 cache can communicate ever so slightly more efficiently, if the programmer specifies these barriers.

----------

Finally, you have "weak ordered atomics", which have no barriers involved at all. While the atomics are guaranteed to execute atomically, their order is completely unspecified.

There's also consume / release barriers, which no one understands and no compiler implements. So ignore those. :-) They're trying to make consume/release easier to understand in a future standard... and I don't think they got all the "bugs" out of the consume/release standard yet.

-------

EDIT: Now that I think of it, acquire_barriers / release_barriers are often baked into a load/store operation and are "relative" to a variable. So the above discussion is still inaccurate. Nonetheless, I think its a simplified discussion to kinda explain why these barriers exist and why programmers were driven to make a "more efficient barrier" mechanic.

ndesaulniers|5 years ago

To the edit: right. I like the description using half barriers, but I have trouble reconciling that with the Linux Kernel's READ_ONCE/WRITE_ONCE macros, which guarantee no tearing/alignment issues, but boil down to reads/writes through casts to volatile qualified pointer dereferences. I guess those don't have the same notion of memory ordering that the C++11 API has... Maybe rmb()/wmb()...