(no title)
neilmovva | 6 months ago
What's less proven is a recipe using MXFP4 x MXFP4 -> FP32 compute, e.g. [1], which needs more involved techniques to work. But if you get it to work stably, that pathway is running at full throughput on 5090.
[0]: https://arxiv.org/abs/2506.08027 [1]: https://arxiv.org/abs/2502.20586
laidoffamazon|6 months ago