top | item 43338830

(no title)

suresk | 11 months ago

Kind of a fun toy problem to play around with. I noticed you had thread coarsening as an option to play around with - there is often some gain to be had here. I think this is also a fun thing to play around with Nsight on - things that are impacting your performance aren't always obvious and it is a pretty good profiler - might be worth playing around with. (I wrote about a fun thing I found with thread coarsening and automatic loop unrolling with Nsight here: https://www.spenceruresk.com/loop-unrolling-gone-bad-e81f66f...)

You may also want to look at other sorting algorithms - common CPU sorting algorithms are hard to maximize GPU hardware with - a network sort like bitonic sorting involves more work (and you have to pad to a power of 2) but often runs much faster on parallel hardware.

I had a fairly naive implementation that would sort 10M in around 10ms on an H100. I'm sure with more work they can get quite a bit faster, but they need to be fairly big to make up for the kernel launch overhead.

discuss

No comments yet.