Interesting timing on posting this to HN, I've recently been optimizing my WebGPU LSD radix sort. Today I measured it against the Thrust CUDA version, and it's about 10x slower (15ms to 1.5ms). My goal was to try to get 10 million elements in 1 ms, but now that I know 3 million in 1.5ms is impossible even for Thrust I know I won't be able to beat that.
gpuhacker|2 years ago
AFAIK Thrust is intended to simplify GPU programming. It could well be that for specific use cases, in particular when it is possible to fuse multiple operations into single kernels, you could outperform Thrust.
AndrewPGameDev|2 years ago
Additionally Wgpu (the library) will insert fences between all passes that have a read-write dependency on a binding, even if there is technically no fence needed as 2 passes might not access the same indices.
Finally I know that there is an algorithm called decoupled look back that can speed up prefix sums, but it requires a forward-progress guarantee. All recent NVIDIA cards can run it but I don't think AMD can, so WebGPU can't in general. Raph Levien has a blog post on the subject https://raphlinus.github.io/gpu/2021/11/17/prefix-sum-portab...