(no title)
chillee | 6 months ago
If you compute out the MFU the author gets it's 1.44 million input tokens per second * 37 billion active params * 2 (FMA) / 8 [GPUs per instance] = 13 Petaflops per second. That's approximately 7x absolutely peak FLOPS on the hardware. Obviously, that's impossible.
There's many other issues with this article, such as assuming only 32 concurrent requests(?), only 8 GPUs per instance as opposed to the more efficient/standard prefill-decode disagg setups, assuming that attention computation is the main thing that makes models compute-bound, etc. It's a bit of an indictment of HN's understanding of LLMs that most people are bringing up issues with the article that aren't any of the fundamental misunderstandings here.
pama|6 months ago
https://lmsys.org/blog/2025-05-05-large-scale-ep/
This has gotten significantly cheaper yet with additional code hacks since then, and with using the B200s.
ma2rten|6 months ago
Aeolun|6 months ago
resonious|6 months ago
Den_VR|6 months ago
chillee|6 months ago
diamond559|6 months ago
johnnypangs|6 months ago
unknown|6 months ago
[deleted]