(no title)
otherjason | 1 year ago
The claim at the end of your post, suggesting that >1 block per SM is always better than 1 block per SM, isn’t strictly true either. In the example you gave, you’re limited to 60 blocks because the thread count of each block is too high. You could, for example, cut the blocks in half to yield 120 blocks. But each block has half as many threads in it, so you don’t automatically get any occupancy benefit by doing so.
When planning out the geometry of a CUDA thread grid, there are inherent tradeoffs between SM thread and/or warp scheduler limits, shared memory usage, register usage, and overall SM count, and those tradeoffs can be counterintuitive if you follow (admittedly, NVIDIA’s official) guidance that maximizing the thread count leads to optimal performance.
No comments yet.