Good point. But the overall point about Mojo availing a different level of abstraction as compared to Python still stands: I imagine that no amount of magic/operator-fusion/etc in `torch.compile()` would let one get reasonable performance for an implementation of, say, flash-attn. One would have to use CUDA/Triton/Mojo/etc.
boroboro4|5 months ago
Somehow python managed to be both high level and low level language for GPUs…
davidatbu|5 months ago
Also, flash attention is at v3-beta right now? [0] And it requires one of CUDA/Triton/ROCm?
[0] https://github.com/Dao-AILab/flash-attention
But maybe I'm out of the loop? Where do you see that flash attention 4 is written in Python?