(no title)
suresk | 11 months ago
You may also want to look at other sorting algorithms - common CPU sorting algorithms are hard to maximize GPU hardware with - a network sort like bitonic sorting involves more work (and you have to pad to a power of 2) but often runs much faster on parallel hardware.
I had a fairly naive implementation that would sort 10M in around 10ms on an H100. I'm sure with more work they can get quite a bit faster, but they need to be fairly big to make up for the kernel launch overhead.
No comments yet.