top | item 37069602

(no title)

Interesting timing on posting this to HN, I've recently been optimizing my WebGPU LSD radix sort. Today I measured it against the Thrust CUDA version, and it's about 10x slower (15ms to 1.5ms). My goal was to try to get 10 million elements in 1 ms, but now that I know 3 million in 1.5ms is impossible even for Thrust I know I won't be able to beat that.

discuss

gpuhacker|2 years ago

I haven't tried WebGPU yet, is there an overall performance hit compared to direct CUDA programming?

AFAIK Thrust is intended to simplify GPU programming. It could well be that for specific use cases, in particular when it is possible to fuse multiple operations into single kernels, you could outperform Thrust.

AndrewPGameDev|2 years ago

There is definitely at least a performance hit in that wgpu (and I think WebGPU too) only supports a single queue. That means you can't asynchronously run compute tasks while running render tasks.

Additionally Wgpu (the library) will insert fences between all passes that have a read-write dependency on a binding, even if there is technically no fence needed as 2 passes might not access the same indices.

Finally I know that there is an algorithm called decoupled look back that can speed up prefix sums, but it requires a forward-progress guarantee. All recent NVIDIA cards can run it but I don't think AMD can, so WebGPU can't in general. Raph Levien has a blog post on the subject https://raphlinus.github.io/gpu/2021/11/17/prefix-sum-portab...