top | item 36317797

(no title)

I believe this post is referring to device-scoped memory barriers - also sometimes called fences - as opposed to execution barriers.

The former being a mechanism to ensure memory accesses follow a well defined order (e.g. it'd be bad if the memory accesses executed inside a critical section could be reordered before or after the lock and unlock calls).

The latter being a mechanism that ensures all threads (within some scope, perhaps all threads running on the "device") reach the same point in the program before any are allowed to proceed.

discuss

raphlinus|2 years ago

That's correct, it's the memory scope that I expect to be device-scoped. GPUs tend not to have execution barriers in the shader language beyond workgroup scope; generally the next coarser granularity for synchronization is a separate dispatch. However, single-pass prefix sum algorithms, including decoupled look-back, can function just fine with device-scoped memory barriers, and do not require execution barriers with coarser scope than workgroup.

samus|2 years ago

The post also mentions unspecified behavior (mixing atomic and non-atomic memory accesses) where everybody has to cross their fingers and hope that the hardware designers had the same idea about how it should work. Which is almost fine with enough test coverage, but a shader translation layer adds uncomfortable complexity on top of it.