Cool, this definitely seems like a good enumeration of techniques, nice to see that they discuss stuff like kernel fission as well. Having a good understanding of loop nest optimization transformations (tiling, fission, fusion, strip mining/sinking, iteration order changes, etc) provides a good vocabulary for talking about this stuff too.
As someone who has spent 80%+ of their time CUDA programming for the past 9 years (I wrote the original GPU PyTorch tensor library, the Faiss GPU library, and several things that Nvidia took and put into cuDNN), I found the most instructive, short yet "advanced" education on the subject to be Paulius Micikevicius' various slide decks on "Peformance Optimization"; e.g.:
Do you have any career advice for someone deeply interested in breaking into high performance GPU programming? I find resources like these, and projects like OpenAIs Triton compiler or MIMD-on-GPU so incredibly interesting.
But I have no idea who employs those skills! Beyond scientific HPC groups or ML research teams anyway - I doubt they’d accept someone without a PhD.
My current gameplan is getting through “Professional CUDA C programming” and various computer architecture textbooks, and seeing if that’s enough.
To ones who are interested: "Programming Massively Parallel Processors: A Hands-on Approach" is a great book to learn CUDA programming, and it talks mostly about performance because, after all, GPU is about speed.
Unlike normal programming books, it talks a lot about how GPUs work and how the introduced techniques fit in that picture. It's interesting even if you are just curious how a (NVIDIA) GPU works at code-level. Strongly recommended.
I bought the first edition when it came out, and definitely it was a gold mine of information on the subject. I wonder though, is the fourth edition worth buying another copy? Nvidia has been advancing CUDA, in particular moving more towards C++ in the kernel language. But none of that was present when this book came out in 2007. Now more and more stuff is happening at thread block level with the cooperative group C++ API and warp level for tensor cores. It would be great if the authors revisited all the early chapters to modernize that content, but that's a lot of work so I don't usually count on authors making such an effort for later editions.
it's true - out of all of the "LEARN CUDA IN 24 HOURS" books, this is the best one. indeed this isn't one of those same books - this is a textbook - but at first glance it resembles them (at least the color scheme and the title led me astray when i first found it).
Does anybody have an idea on how to get in to Metal programming (as in Apple Metal)? I'd love to mess around a little with this on iOS and macOS while learning about tile-based rendering, but I have trouble locating educational written material.
There's a book (https://metalbyexample.com/the-book/), but the author has put up a note that it's quite out of date. It seems the most up-to-date information is available in the WWDC videos (regarding e.g. Metal 3), but I'd really prefer something written. And Apple's documentation reads more like a reference material and is quite confusing when starting out.
(+1) I'm a newb to Metal myself, and I wanted to use Swift as the driving language (which was a main selling point). Unfortunately, almost all the material is in Objective C.
If people like GPU programming, I wrote a blog post this week about GPU-accelerated hashmaps, semi-provocatively titled "Can we 10x Rust hashmap throughput?".
I've been looking into getting into GPU programming, starting with CS334 (https://developer.nvidia.com/udacity-cs344-intro-parallel-pr...) on Udacity. I'm curious to hear from some of the more seasoned GPU veterans out there, what other resources would be good to take a look at after finishing the videos and assignments?
If you want to go really in-depth I can recommend GTC on demand. It's Nvidia streaming platform with videos from past GTC conferences. Tony Scuderio had a couple of videos on there called GPU memory bootcamp that are among the best advanced GPU programming learning material out there.
Partly related I believe so perhaps someone can help. Whole theses have been written on prefix sum algorithms, and I never got it. Perhaps someone kind can give some convincing examples of their advantages.
Not speaking to their implementation, but prefix sums/scans are simply a very useful primitive tool for parallelizing many otherwise sequential operations. For instance, appending a variable number of items per worker to a shared coalesced buffer uses an exclusive prefix sum. This is probably the most common use case for them in practical programming. They can also be used to partition work across parallel workers (segmented prefix scans).
In lieu of pointer chasing, hashing and the like, parallel operations on flat arrays are the way to maximize GPU utilization.
It's used in one of the fastest sorting approaches - counting sort / binning - to compute the location of where to store the sorted/binned items. First you count the number of items per bin, then you use prefix-sums to compute the memory location of each bin, then you insert the items into the respective bins. Some radix-sort implementations also utilize counting sort under the hood, and therefore prefix-sums. (Not sure if all radix-sort implementations need it)
It's incredibly useful if you have many threads that produce a variable number of outputs. Imagine you're implementing some filtering operation on the GPU, many threads will take on a fixed workload and then produce some number of outcomes. Unless we take some precautions, we have a huge synchronization problem when all threads try to append their results to the output. Note that GPUs didn't have atomics for the first couple of generations that supported CUDA, so you couldn't just getAndIncrement an index and append to an array. We could store those outputs in a dense structure, allocating a fixed number of output slots per thread, but that would leave many blanks in between the results. Now once we know the number of outputs per thread we can use a prefix sum to let every thread know where they can write their results in the array.
The outcome of a prefix sum exactly corresponds with the "row starts" part of the CSR sparse matrix notation. So they are also essential when creating sparse matrices.
Interesting timing on posting this to HN, I've recently been optimizing my WebGPU LSD radix sort. Today I measured it against the Thrust CUDA version, and it's about 10x slower (15ms to 1.5ms). My goal was to try to get 10 million elements in 1 ms, but now that I know 3 million in 1.5ms is impossible even for Thrust I know I won't be able to beat that.
I haven't tried WebGPU yet, is there an overall performance hit compared to direct CUDA programming?
AFAIK Thrust is intended to simplify GPU programming. It could well be that for specific use cases, in particular when it is possible to fuse multiple operations into single kernels, you could outperform Thrust.
Humble self-promo here, may I also recommend the team at CentML who dedicated their academic life (PhD and above) to GPU optimizations for high-performance ML/AI to lower the costs.
[+] [-] jhj|2 years ago|reply
As someone who has spent 80%+ of their time CUDA programming for the past 9 years (I wrote the original GPU PyTorch tensor library, the Faiss GPU library, and several things that Nvidia took and put into cuDNN), I found the most instructive, short yet "advanced" education on the subject to be Paulius Micikevicius' various slide decks on "Peformance Optimization"; e.g.:
https://on-demand.gputechconf.com/gtc/2013/presentations/S34...
(there are some other ones outstanding, think one was for the Volta architecture as well)
They're old but still very relevant to today's GPUs.
[+] [-] hazz99|2 years ago|reply
But I have no idea who employs those skills! Beyond scientific HPC groups or ML research teams anyway - I doubt they’d accept someone without a PhD.
My current gameplan is getting through “Professional CUDA C programming” and various computer architecture textbooks, and seeing if that’s enough.
[+] [-] flakiness|2 years ago|reply
Unlike normal programming books, it talks a lot about how GPUs work and how the introduced techniques fit in that picture. It's interesting even if you are just curious how a (NVIDIA) GPU works at code-level. Strongly recommended.
[+] [-] gpuhacker|2 years ago|reply
[+] [-] AlexDenisov|2 years ago|reply
Programming Massively Parallel Processors: https://www.youtube.com/watch?v=4pkbXmE4POc&list=PLRRuQYjFhp...
[+] [-] mathisfun123|2 years ago|reply
it's true - out of all of the "LEARN CUDA IN 24 HOURS" books, this is the best one. indeed this isn't one of those same books - this is a textbook - but at first glance it resembles them (at least the color scheme and the title led me astray when i first found it).
[+] [-] hgomersall|2 years ago|reply
[+] [-] w-m|2 years ago|reply
There's a book (https://metalbyexample.com/the-book/), but the author has put up a note that it's quite out of date. It seems the most up-to-date information is available in the WWDC videos (regarding e.g. Metal 3), but I'd really prefer something written. And Apple's documentation reads more like a reference material and is quite confusing when starting out.
[+] [-] pjmlp|2 years ago|reply
https://www.amazon.com/Metal-Programming-Guide-Tutorial-Refe...
For the rest, yes, WWDC videos, samples, and then documentation, by this order.
[+] [-] winwang|2 years ago|reply
[+] [-] winwang|2 years ago|reply
HN post here: https://news.ycombinator.com/item?id=37036058
[+] [-] eachro|2 years ago|reply
[+] [-] gpuhacker|2 years ago|reply
[+] [-] yzh|2 years ago|reply
[+] [-] pengaru|2 years ago|reply
[+] [-] _a_a_a_|2 years ago|reply
[+] [-] jhj|2 years ago|reply
In lieu of pointer chasing, hashing and the like, parallel operations on flat arrays are the way to maximize GPU utilization.
[+] [-] shwestrick|2 years ago|reply
- compact a hash table (i.e., remove the empty slots)
- flatten a jagged 2D array
- rewrite a dense matrix in compressed-sparse-row (CSR) format
[+] [-] mschuetz|2 years ago|reply
[+] [-] gpuhacker|2 years ago|reply
The outcome of a prefix sum exactly corresponds with the "row starts" part of the CSR sparse matrix notation. So they are also essential when creating sparse matrices.
[+] [-] AndrewPGameDev|2 years ago|reply
[+] [-] gpuhacker|2 years ago|reply
AFAIK Thrust is intended to simplify GPU programming. It could well be that for specific use cases, in particular when it is possible to fuse multiple operations into single kernels, you could outperform Thrust.
[+] [-] pavelstoev|2 years ago|reply
[+] [-] johnthescott|2 years ago|reply