top | item 45057409

(no title)

chillee | 6 months ago

This article's math is wrong on many fundamental levels. One of the most obvious ones is that prefill is nowhere near bandwidth bound.

If you compute out the MFU the author gets it's 1.44 million input tokens per second * 37 billion active params * 2 (FMA) / 8 [GPUs per instance] = 13 Petaflops per second. That's approximately 7x absolutely peak FLOPS on the hardware. Obviously, that's impossible.

There's many other issues with this article, such as assuming only 32 concurrent requests(?), only 8 GPUs per instance as opposed to the more efficient/standard prefill-decode disagg setups, assuming that attention computation is the main thing that makes models compute-bound, etc. It's a bit of an indictment of HN's understanding of LLMs that most people are bringing up issues with the article that aren't any of the fundamental misunderstandings here.

discuss

order

pama|6 months ago

Agree that the writeup is very wrong, especially for the output tokens. Here is how anyone with enough money to allocate a small cluster of powerful GPUs can decode huge models at scale, since nearly 4 months ago, with costs of 0.2 USD/million output tokens.

https://lmsys.org/blog/2025-05-05-large-scale-ep/

This has gotten significantly cheaper yet with additional code hacks since then, and with using the B200s.

ma2rten|6 months ago

You can also look at the price of opensource models on openrouter, which are a fraction of the cost of closed source models. This is a market that is heavily commoditized, so I would expect it reflect the true cost with a small margin.

Aeolun|6 months ago

As much as I appreciate you saying the math is wrong, it doesn’t really help me adjust my expectations unless you provide correct numbers as well.

resonious|6 months ago

Right. Now I want to know if they're really losing money or not.

Den_VR|6 months ago

So, bottom line, do you think it’s probable that either OpenAI or Anthropic are “losing money on inference?”

chillee|6 months ago

No. In some sense, the article comes to the right conclusion haha. But it's probably >100x off on its central premise about output tokens costing more than input.

diamond559|6 months ago

Even if it is, ignoring the biggest costs going into the product and then claiming they are profitable would be actual fraud.

johnnypangs|6 months ago

As one of those people who doesn’t really understand llms, does anyone have any recommendations to better my understanding of them?