While on CPU sequentially consistent semantics are efficient to implement, that seems to be much less true on GPU. Thus, Vulkan completely eliminates sequential consistency and provides only acquire/release semantics[1].
It is extremely difficult to reason about programs using these advanced memory semantics. For example, there is a discussion about whether a spinlock implemented in terms of acquire and release can be reordered in a way to introduce deadlock (see reddit discussion linked from [2]). I was curious enough about this I tried to model it in CDSChecker, but did not get definitive results (the deadlock checker in that tool is enabled for mutexes provided by API, but not for mutexes built out of primitives). I'll also note that using AcqRel semantics is not provided by the Rust version of compare_exchange_weak (perhaps a nit on TFA's assertion that Rust adopts the C++ memory model wholesale), so if acquire to lock the spinlock is not adequate, it's likely it would need to go to SeqCst.
Thus, I find myself quite unsure whether this kind of spinlock would work on Vulkan or would be prone to deadlock. It's also possible it could be fixed by putting a release barrier before the lock loop.
We have some serious experts on HN, so hopefully someone who knows the answer can enlighten us - mixed in of course with all the confidently wrong assertions that inevitably pop up in discussions about memory model semantics.
Also: it remains difficult to fully nail down the semantics of sequential consistency as well, especially when it's mixed with other memory semantics. Very likely next time Russ updates his article he should add a reference to Repairing Sequential Consistency in C/C++11[1].
Thanks for the GPU insights and links (and the paper link below)!
I based my claim about Rust from https://doc.rust-lang.org/nomicon/atomics.html. ("Rust pretty blatantly just inherits the memory model for atomics from C++20.") Perhaps that is out of date?
Taking a lock only needs to be an acquire operation and a compiler barrier for other lock operations. Using seq_cst or acq_rel semantics is stronger than needed. From my reading and discussions with people from WG21 the current argument for why taking a lock only requires acq semantics is that a compiler optimization that transforms a non-deadlocking program into a potentially deadlocking program is not allowed. There's an interesting twitter thread where we discuss this I can't find anymore :(.
> I'll also note that using AcqRel semantics is not provided by the Rust version of compare_exchange_weak (perhaps a nit on TFA's assertion that Rust adopts the C++ memory model wholesale), so if acquire to lock the spinlock is not adequate, it's likely it would need to go to SeqCst.
Is this true? AcqRel seems to be accepted by the compiler for the success ordering of compare_exchange_weak.
GPU-spinlocks are a bad idea, unless the spinlock is applied over the entire Thread-group.
Even then, I'm pretty sure the spinlock is a bad idea, because you probably should be using GPUs as a coprocessor and enforcing "orderings" over CUDA-Streams or OpenCL Task Graphs. The kernel-spawn and kernel-end mechanism provides you your synchronization functionality ("happens-before") when you need it.
---------
From there on out: the GPU-low level synchronization of choice is the thread-barrier (which can extend out beyond a wavefront, but only up to a block).
--------
So that'd be my advice: use a thread-barrier at the lowest level for thread blocks (synchronization between 1024 threads and below). And use kernel-start / kernel-end graphs (aka: CUDA Stream and/or OpenCL Task Graphs) for synchronizing groups of more than 1024 threads together.
Otherwise, I've done some experiments with acquire/release and basic lock/unlock mechanisms. They seem to work as expected. You get deadlocks immediately on older hardware because of the implicit SIMD-execution (so you want only thread#0 or active-thread#0 to perform the lock for the whole wavefront / thread block). You'll still want to use thread-barriers for higher performance synchronization.
Frankly, I'm not exactly sure why you'd want to use a spinlock since thread-barriers are simply higher performance in the GPU world.
Another somewhat recently posted (but years-old) page with different but related content is 'Memory Models that Underlie Programming Languages': http://canonical.org/~kragen/memory-models/
A lot of the complexity comes from the lack of expressivity in languages to relate variables (or data structure fields) semantically to each other. If there was a way to tell the compiler "these variables are always accessed in tandem", the compiler could be smart about ordering and memory fences.
The idea to extend programming languages and type systems in that direction is not new: folk who've been using distributed computing for computations have to think about this already, and could teach a few things to folk who use shared memory multi-processors.
Even with that expressivity, someone who incorrectly relates or forgets to relate two variables could experience the same issues. It's still important to address what happens when the program has data races or when it is data-race-free but the memory model permits overreaching optimizations. The language and implementation should strive to make a program approximately correct.
All variables inside of an object (aka: any class) are assumed to be related to each other. synchronized(foobar_object){ baz(); } ensures that all uses of foobar_object inside the synchronization{} area are sequential (and therefore correct).
--------
The issue is that some people (a minority) are interested in "beating locks" and making something even more efficient.
Fascinating article. I've been doing research in this area and I wonder if there was exploration for JinjaThreads - which operate on Jinja (a Java-like language) that does a formal DRF proof guarantee (coincidentally using Isabelle/HOL).
I'm wondering: is the fact that a CS PhD finds resources like this as much amusing as educational/pedagogical gold telling something for the Academia, the Culture, or the Self?
AKA why can't I stumble upon such stuff more often. Thanks OP!
If thread 2 copies done into a register before thread 1 executes, it may keep using that register for the entire loop, never noticing that thread 1 later modifies done.
Alternative solution: Forget all the "atomic" semantics and simply avoid "optimization" of global variables. Access to any global variable should always occur direct from memory. Sure, this will be less than optimal in some cases but such is the price of using globals. Their use should be discouraged anyway.
In other words, make "atomic" the sensible and logical default with globals. Assignment is an "atomic" operation, just don't circumvent it by using a local copy as an "optimization".
This problem isn't specific to global variables; it happens with all shared mutable state. I would assume that the author only used global variables because that lets them keep the working examples as short as possible, and minimize irrelevant details.
And yes, you can put a full memory fence around every access to a variable that is shared across threads. But doing so would just destroy the performance of your program. Compared to using a register, accessing main memory typically takes something on the order of 100 times as long. Given that we're talking about concerns that are specific to a relatively low-level approach to parallelism, I think it's safe to assume that performance is the whole point, so that would be an unacceptable tradeoff.
Removing optimizations on "global" variables will leave the bug in Singleton objects (which are very similar to global variables, but the compiler doesn't know that they're global)
---------
"Volatile" is close but not good enough semantically to describe what we want. That's why these new Atomic-variables are being declared with seqcst semantics (be it in Java, C++, C, or whatever you program in).
That's the thing: we need a new class of variables that wasn't known 20 years ago. Variables that follow the sequential-consistency requirements, for this use case.
---------
Note: on ARM systems, even if the compiler doesn't mess up, the L1 cache could mess you up. ARM has multiple load/store semantics available. If you have relaxed (default) semantics on a load, it may be on a "stale" value from DDR4.
That is to say: your loop may load a value into L1 cache, then your core will read the variable over and over from L1 cache (not realizing that L3 cache has been updated to a new value). Not only does your compiler need to know to "not store the value in a register", the memory-subsystem also needs to know to read the data from L3 cache over-and-over again (never using L1 cache).
Rearranging loads/stores on x86 is not allowed in this manner. But ARM is more "Relaxed" than x86. If reading the "Freshest" value is important, you need to have the appropriate memory barriers on ARM (or PowerPC).
> Access to any global variable should always occur direct from memory.
What if your function takes a pointer that might be pointing to a global variable? Does that mean that all accesses through a pointer are now excempt from optimization unless the compiler can prove that the pointer will never point to a global variable?
These "memory models" are too complex for languages intended for dilettante developers. It was a disaster in Java/C#. Not even more than a handful of programmers in existence know in depth how it works, as in, can they understand any given trivial program in their language. At best they only know some vague stuff like that locking prevents any non visibility issues. It goes far deeper than that though (which is also the fault of complex language designs like Java and C#).
The common programmer does not understand that you've just transformed their program - for which they were taught merely that multiple threads needs synchronization - into a new game, which has an entire separate specification, where every shared variable obeys a set of abstruse rules revolving around the happens-before relationship. Locks, mutexes, atomic variables are all one thing. Fences are a completely different thing. At least in the way most people intuit programs to work.
Go tries to appeal to programmers as consumers (that is, when given a choice between cleaner design and pleasing the user who just wants to "get stuff done", they choose the latter), but yet also adds in traditional complexities like this. Yes, there is performance trade off to having shared memory behave intuitively, but that's much better than bugs that 99% of your CHOSEN userbase do not know how to avoid.
Also remember Go has lots of weird edge cases, like sharing a slice across threads can lead to memory corruption (in the C / assembly sense, not merely within that array) despite the rest of the language being memory-safe. Multiply that by the "memory model".
> These "memory models" are too complex for languages intended for dilettante developers.
It would be nice if sometime we stopped pretending that beginners are too slow to know/understand things and instead faced the fact that their instructors and mentors are bad at teaching.
If "not even more than a handful of programmers" understand something, then it's objectively wrong to refer to all programmers - approximately all programmers - as "dilletantes".
Also, maybe you are different, but I can only keep so much in my head at a time. If I can keep something simple or abstract it away so I can focus on other details, that doesn't make me a dilletante. It makes me more effective at what I'm actually trying to do.
All expert developers were dilettante at some point. The only way to become an expert in some specific area is to study and practice it. It might make you a better developer even if you don't end up using it in anger often (or at all).
> Go has lots of weird edge cases, like sharing a slice across threads can lead to memory corruption (in the C / assembly sense, not merely within that array)
In a 100 years the main languages used will still be C on the client (with a C++ compiler) and Java on the server.
Go has no VM but it has a GC.
WASM has a VM but no GC.
Eveything has been tried and Java still kicks everythings ass to the moon on the server.
Fragmentation is bad, lets stop using bad languages and focus on the products we build instead.
"While I'm on the topic of concurrency I should mention my far too brief chat with Doug Lea. He commented that multi-threaded Java these days far outperforms C, due to the memory management and a garbage collector. If I recall correctly he said "only 12 times faster than C means you haven't started optimizing"." - Martin Fowler https://martinfowler.com/bliki/OOPSLA2005.html
"Many lock-free structures offer atomic-free read paths, notably concurrent containers in garbage collected languages, such as ConcurrentHashMap in Java. Languages without garbage collection have fewer straightforward options, mostly because safe memory reclamation is a hard problem..." - Travis Downs https://travisdowns.github.io/blog/2020/07/06/concurrency-co...
I'm sorry is this comment from 1998? I've been working in software for over a decade, and I haven't seen server work being done in Java in ages.
From my perspective, Go in the context of serverless programming seems to currently be the best choice for server-side programming.
In the next 20 years I expect Go will be supplanted by a language which is a lot like go (automatic memory management, simple, easy to learn & write and performant enough) but with the addition of algebraic data types, named parameters, and a slightly higher level of abstraction.
Pretty sure Java is/was popular (and has enormous momentum) because "it just works" at a time when the Internet is taking off, not because it is some linguistic or technological marvel. It will definitely stick around, just like COBOL and C and Go.
[+] [-] raphlinus|4 years ago|reply
While on CPU sequentially consistent semantics are efficient to implement, that seems to be much less true on GPU. Thus, Vulkan completely eliminates sequential consistency and provides only acquire/release semantics[1].
It is extremely difficult to reason about programs using these advanced memory semantics. For example, there is a discussion about whether a spinlock implemented in terms of acquire and release can be reordered in a way to introduce deadlock (see reddit discussion linked from [2]). I was curious enough about this I tried to model it in CDSChecker, but did not get definitive results (the deadlock checker in that tool is enabled for mutexes provided by API, but not for mutexes built out of primitives). I'll also note that using AcqRel semantics is not provided by the Rust version of compare_exchange_weak (perhaps a nit on TFA's assertion that Rust adopts the C++ memory model wholesale), so if acquire to lock the spinlock is not adequate, it's likely it would need to go to SeqCst.
Thus, I find myself quite unsure whether this kind of spinlock would work on Vulkan or would be prone to deadlock. It's also possible it could be fixed by putting a release barrier before the lock loop.
We have some serious experts on HN, so hopefully someone who knows the answer can enlighten us - mixed in of course with all the confidently wrong assertions that inevitably pop up in discussions about memory model semantics.
[1]: https://www.khronos.org/blog/comparing-the-vulkan-spir-v-mem...
[2]: https://rigtorp.se/spinlock/
[+] [-] raphlinus|4 years ago|reply
[1]: https://plv.mpi-sws.org/scfix/full.pdf
[+] [-] rsc|4 years ago|reply
I based my claim about Rust from https://doc.rust-lang.org/nomicon/atomics.html. ("Rust pretty blatantly just inherits the memory model for atomics from C++20.") Perhaps that is out of date?
[+] [-] rigtorp|4 years ago|reply
Taking a lock only needs to be an acquire operation and a compiler barrier for other lock operations. Using seq_cst or acq_rel semantics is stronger than needed. From my reading and discussions with people from WG21 the current argument for why taking a lock only requires acq semantics is that a compiler optimization that transforms a non-deadlocking program into a potentially deadlocking program is not allowed. There's an interesting twitter thread where we discuss this I can't find anymore :(.
[+] [-] spinlocker|4 years ago|reply
Is this true? AcqRel seems to be accepted by the compiler for the success ordering of compare_exchange_weak.
[+] [-] dragontamer|4 years ago|reply
Even then, I'm pretty sure the spinlock is a bad idea, because you probably should be using GPUs as a coprocessor and enforcing "orderings" over CUDA-Streams or OpenCL Task Graphs. The kernel-spawn and kernel-end mechanism provides you your synchronization functionality ("happens-before") when you need it.
---------
From there on out: the GPU-low level synchronization of choice is the thread-barrier (which can extend out beyond a wavefront, but only up to a block).
--------
So that'd be my advice: use a thread-barrier at the lowest level for thread blocks (synchronization between 1024 threads and below). And use kernel-start / kernel-end graphs (aka: CUDA Stream and/or OpenCL Task Graphs) for synchronizing groups of more than 1024 threads together.
Otherwise, I've done some experiments with acquire/release and basic lock/unlock mechanisms. They seem to work as expected. You get deadlocks immediately on older hardware because of the implicit SIMD-execution (so you want only thread#0 or active-thread#0 to perform the lock for the whole wavefront / thread block). You'll still want to use thread-barriers for higher performance synchronization.
Frankly, I'm not exactly sure why you'd want to use a spinlock since thread-barriers are simply higher performance in the GPU world.
[+] [-] wcarss|4 years ago|reply
Another somewhat recently posted (but years-old) page with different but related content is 'Memory Models that Underlie Programming Languages': http://canonical.org/~kragen/memory-models/
a few previous hn discussions of that one:
https://news.ycombinator.com/item?id=17099608
https://news.ycombinator.com/item?id=27455509
https://news.ycombinator.com/item?id=13293290
[+] [-] electricshampo1|4 years ago|reply
This is not true for Java; see
http://gee.cs.oswego.edu/dl/html/j9mm.html
https://docs.oracle.com/en/java/javase/16/docs/api/java.base...
[+] [-] dragontamer|4 years ago|reply
If you want to test out weaker acquire/release semantics, you need to buy an ARM or POWER9 processor.
[+] [-] knz42|4 years ago|reply
The idea to extend programming languages and type systems in that direction is not new: folk who've been using distributed computing for computations have to think about this already, and could teach a few things to folk who use shared memory multi-processors.
Here's an idea for ISA primitives that could help a language group variables together: bind/propagate operators on (combinations of) address ranges. https://pure.uva.nl/ws/files/1813114/109501_19.pdf
[+] [-] smasher164|4 years ago|reply
[+] [-] dragontamer|4 years ago|reply
All variables inside of an object (aka: any class) are assumed to be related to each other. synchronized(foobar_object){ baz(); } ensures that all uses of foobar_object inside the synchronization{} area are sequential (and therefore correct).
--------
The issue is that some people (a minority) are interested in "beating locks" and making something even more efficient.
[+] [-] mahmoudimus|4 years ago|reply
You can read more about this here if you're interested: https://www.isa-afp.org/entries/JinjaThreads.html
[+] [-] romesmoke|4 years ago|reply
AKA why can't I stumble upon such stuff more often. Thanks OP!
[+] [-] jqpabc123|4 years ago|reply
Alternative solution: Forget all the "atomic" semantics and simply avoid "optimization" of global variables. Access to any global variable should always occur direct from memory. Sure, this will be less than optimal in some cases but such is the price of using globals. Their use should be discouraged anyway.
In other words, make "atomic" the sensible and logical default with globals. Assignment is an "atomic" operation, just don't circumvent it by using a local copy as an "optimization".
[+] [-] mumblemumble|4 years ago|reply
And yes, you can put a full memory fence around every access to a variable that is shared across threads. But doing so would just destroy the performance of your program. Compared to using a register, accessing main memory typically takes something on the order of 100 times as long. Given that we're talking about concerns that are specific to a relatively low-level approach to parallelism, I think it's safe to assume that performance is the whole point, so that would be an unacceptable tradeoff.
[+] [-] dragontamer|4 years ago|reply
---------
"Volatile" is close but not good enough semantically to describe what we want. That's why these new Atomic-variables are being declared with seqcst semantics (be it in Java, C++, C, or whatever you program in).
That's the thing: we need a new class of variables that wasn't known 20 years ago. Variables that follow the sequential-consistency requirements, for this use case.
---------
Note: on ARM systems, even if the compiler doesn't mess up, the L1 cache could mess you up. ARM has multiple load/store semantics available. If you have relaxed (default) semantics on a load, it may be on a "stale" value from DDR4.
That is to say: your loop may load a value into L1 cache, then your core will read the variable over and over from L1 cache (not realizing that L3 cache has been updated to a new value). Not only does your compiler need to know to "not store the value in a register", the memory-subsystem also needs to know to read the data from L3 cache over-and-over again (never using L1 cache).
Rearranging loads/stores on x86 is not allowed in this manner. But ARM is more "Relaxed" than x86. If reading the "Freshest" value is important, you need to have the appropriate memory barriers on ARM (or PowerPC).
[+] [-] ekiwi|4 years ago|reply
What if your function takes a pointer that might be pointing to a global variable? Does that mean that all accesses through a pointer are now excempt from optimization unless the compiler can prove that the pointer will never point to a global variable?
[+] [-] voidnullnil|4 years ago|reply
The common programmer does not understand that you've just transformed their program - for which they were taught merely that multiple threads needs synchronization - into a new game, which has an entire separate specification, where every shared variable obeys a set of abstruse rules revolving around the happens-before relationship. Locks, mutexes, atomic variables are all one thing. Fences are a completely different thing. At least in the way most people intuit programs to work.
Go tries to appeal to programmers as consumers (that is, when given a choice between cleaner design and pleasing the user who just wants to "get stuff done", they choose the latter), but yet also adds in traditional complexities like this. Yes, there is performance trade off to having shared memory behave intuitively, but that's much better than bugs that 99% of your CHOSEN userbase do not know how to avoid. Also remember Go has lots of weird edge cases, like sharing a slice across threads can lead to memory corruption (in the C / assembly sense, not merely within that array) despite the rest of the language being memory-safe. Multiply that by the "memory model".
Edit: forgot spaces between paragraphs.
[+] [-] filleduchaos|4 years ago|reply
It would be nice if sometime we stopped pretending that beginners are too slow to know/understand things and instead faced the fact that their instructors and mentors are bad at teaching.
[+] [-] rsc|4 years ago|reply
Also, maybe you are different, but I can only keep so much in my head at a time. If I can keep something simple or abstract it away so I can focus on other details, that doesn't make me a dilletante. It makes me more effective at what I'm actually trying to do.
[+] [-] temac|4 years ago|reply
[+] [-] gpderetta|4 years ago|reply
[+] [-] chakkepolja|4 years ago|reply
[+] [-] sagichmal|4 years ago|reply
Source?
[+] [-] bullen|4 years ago|reply
Go has no VM but it has a GC. WASM has a VM but no GC.
Eveything has been tried and Java still kicks everythings ass to the moon on the server.
Fragmentation is bad, lets stop using bad languages and focus on the products we build instead.
"While I'm on the topic of concurrency I should mention my far too brief chat with Doug Lea. He commented that multi-threaded Java these days far outperforms C, due to the memory management and a garbage collector. If I recall correctly he said "only 12 times faster than C means you haven't started optimizing"." - Martin Fowler https://martinfowler.com/bliki/OOPSLA2005.html
"Many lock-free structures offer atomic-free read paths, notably concurrent containers in garbage collected languages, such as ConcurrentHashMap in Java. Languages without garbage collection have fewer straightforward options, mostly because safe memory reclamation is a hard problem..." - Travis Downs https://travisdowns.github.io/blog/2020/07/06/concurrency-co...
[+] [-] skohan|4 years ago|reply
From my perspective, Go in the context of serverless programming seems to currently be the best choice for server-side programming.
In the next 20 years I expect Go will be supplanted by a language which is a lot like go (automatic memory management, simple, easy to learn & write and performant enough) but with the addition of algebraic data types, named parameters, and a slightly higher level of abstraction.
[+] [-] pphysch|4 years ago|reply