(no title)
refibrillator | 8 months ago
https://news.ycombinator.com/item?id=44111673
I find it curious that fundamentals of the CUDA programming model (eg kernel launches) are being subverted in favor of fine grained task based parallelism that ends up using the hardware more effectively. Makes me wonder if CUDA has been holding us back in some ways.
What are the chances we see your work land in PyTorch as an experimental backend?
Awesome stuff thanks for sharing.
P.S. minor typo, your first two paragraphs under part 1 are nearly identical.
zhihaojia|8 months ago
I completely agree that CUDA can be a limiting factor, especially for latency-sensitive workloads. As GPUs are becoming larger and faster, it's increasingly difficult to write standalone kernels that fully utilize hardware resources—particularly when optimizing for low latency with small batch sizes.
> What are the chances we see your work land in PyTorch as an experimental backend?
We're definitely excited about that direction. We believe MPK can help PyTorch support megakernel generation, and we’re actively exploring how to make that happen. Stay tuned!
> P.S. minor typo, your first two paragraphs under part 1 are nearly identical.
Thanks for pointing it out--I meant to remove the duplicate paragraph when finalizing the post.
pavelstoev|8 months ago
Thank you !