top | item 23045857

(no title)

bronxbomber92 | 5 years ago

On 1, you are right that most HW most of the time will make the requisite forward progress after the atomic bump. But Vulkan does not guarantee it and most HW doesn't either (you're just getting lucky with a high probability).

The property you're describing is discussed in https://johnwickerson.github.io/papers/forwardprogress_concu..., page 1:10:

"After probing our zoo of GPUs with litmus tests, we determined that all GPUs had the following property: once a work-group begins executing a kernel (i.e. the work-group becomes occupant on a hardware resource), it will continue executing until it reaches the end of the kernel. We call this execution model the occupancy bound execution model, because the number of work-groups for which relative forward progress is guaranteed is bound by the hardware resources available for executing work-groups; i.e. the hardware resources determine how many work-groups can be occupant."

However, I know even that property is not true of all GPUs and thus cannot be assumed by Vulkan spec compliant code. (I actually doubt it's true of any GPU's pre-Volta - it's just that the progress needed is often a natural side-effect of switching between threads for latency hiding).

Volta does guarantee this property + more. All starvation-free algorithms are supported by the architecture. See https://devblogs.nvidia.com/inside-volta/. In particular, this line:

"Independent thread scheduling in Volta ensures that even if a thread T0 currently holds the lock for node A, another thread T1 in the same warp can successfully wait for the lock to become available without impeding the progress of thread T0."

Independent thread scheduling is often thought about as providing a deadlock freedom guarantee between threads in the same warp, but the same guarantees must also extend to coarse execution scopes (e.g. a workgroup waiting on another workgroup).

So, as far as Vulkan is concerned, I do not consider this a defect in the spec. I have seen this exact algorithm deployed in production and deadlock despite never having deadlocked at the developer's desk.

On 3 - sounds like another good blog post topic. It'd be interesting to hear your experience trying to (and being unable to) write shaders that are subgroup agnostic but use subgroup functionality. I should Brian's blog post on the matrix transpose as well, maybe some of that experience is documented there.

discuss

order

fluffything|5 years ago

> However, I know even that property is not true of all GPUs and thus cannot be assumed by Vulkan spec compliant code. (I actually doubt it's true of any GPU's pre-Volta - it's just that the progress needed is often a natural side-effect of switching between threads for latency hiding).

To be more precise. Pre-Volta GPU (like Pascal, Kepler, etc.) guarantee that once a thread block starts running on an SM it will run there to completion, i.e., its resources won't be freed until the kernel completes.

Warps of threads within a thread block are not guaranteed to run in any order, and there are no guarantees that, e.g., if warp A spins on a lock hold by warp B, that warp B will ever run and make process, and therefore the thread block might never run to completion.

Volta and later architectures guarantees forward progress in this case. That is, if a warp A spin locks on a lock held by warp B, Volta and later guarantee that warp B will make progress at some point, allowing it to release the lock so that warp A can make progress as well.

fluffything|5 years ago

> Warps of threads within a thread block are not guaranteed to run in any order, and there are no guarantees that, e.g., if warp A spins on a lock hold by warp B, that warp B will ever run and make process, and therefore the thread block might never run to completion.

FYI this is wrong, pre-Volta guarantees that all warps will run to completion, what it doesn't guarantee is the same for threads within a warp. post-Volta guarantees that.

raphlinus|5 years ago

Super, thanks for the info. I should update the blog post with it.

That paper calls for standardizing forward progress guarantees. Because you mention that there are GPUs that don't meet the occupancy bound as defined in the paper, that work might not go smoothly.

It occurs to me that for this particular algorithm there might be a chance to rescue it. Instead of simply spinning, a waiting workgroup might make a small amount of progress recomputing the aggregate for the partition it's waiting on. After a finite number of spins (the partition size divided by this grain of progress), it would have the aggregate for the partition so would be able to move on to the next partition. Thus the correctness concern becomes a performance concern, where a very high probability of the spin yielding early is likely "good enough."

I'll not make any promises about another blog - I worry that subgroup size tuning is too much in the weeds other than for a very specialized audience. But I certainly do hope to blog more about piet-gpu and will see if I can touch on the topic then.

bronxbomber92|5 years ago

> It occurs to me that for this particular algorithm there might be a chance to rescue it. Instead of simply spinning, a waiting workgroup might make a small amount of progress recomputing the aggregate for the partition it's waiting on. After a finite number of spins (the partition size divided by this grain of progress), it would have the aggregate for the partition so would be able to move on to the next partition. Thus the correctness concern becomes a performance concern, where a very high probability of the spin yielding early is likely "good enough."

Yes, I've seen this workaround perform well in practice, sorry for not mentioning it >.<. It does blow up any asymptotic efficiency guarantees though, which may or may not be acceptable for the application at hand.

> I'll not make any promises about another blog - I worry that subgroup size tuning is too much in the weeds other than for a very specialized audience. But I certainly do hope to blog more about piet-gpu and will see if I can touch on the topic then.

Looking forward to any future piet-gpu posts you may write :-).