top | item 45050415

Are OpenAI and Anthropic losing money on inference?

515 points| martinald | 6 months ago |martinalderson.com

478 comments

order
[+] chillee|6 months ago|reply
This article's math is wrong on many fundamental levels. One of the most obvious ones is that prefill is nowhere near bandwidth bound.

If you compute out the MFU the author gets it's 1.44 million input tokens per second * 37 billion active params * 2 (FMA) / 8 [GPUs per instance] = 13 Petaflops per second. That's approximately 7x absolutely peak FLOPS on the hardware. Obviously, that's impossible.

There's many other issues with this article, such as assuming only 32 concurrent requests(?), only 8 GPUs per instance as opposed to the more efficient/standard prefill-decode disagg setups, assuming that attention computation is the main thing that makes models compute-bound, etc. It's a bit of an indictment of HN's understanding of LLMs that most people are bringing up issues with the article that aren't any of the fundamental misunderstandings here.

[+] pama|6 months ago|reply
Agree that the writeup is very wrong, especially for the output tokens. Here is how anyone with enough money to allocate a small cluster of powerful GPUs can decode huge models at scale, since nearly 4 months ago, with costs of 0.2 USD/million output tokens.

https://lmsys.org/blog/2025-05-05-large-scale-ep/

This has gotten significantly cheaper yet with additional code hacks since then, and with using the B200s.

[+] Aeolun|6 months ago|reply
As much as I appreciate you saying the math is wrong, it doesn’t really help me adjust my expectations unless you provide correct numbers as well.
[+] Den_VR|6 months ago|reply
So, bottom line, do you think it’s probable that either OpenAI or Anthropic are “losing money on inference?”
[+] johnnypangs|6 months ago|reply
As one of those people who doesn’t really understand llms, does anyone have any recommendations to better my understanding of them?
[+] _sword|6 months ago|reply
I've done the modeling on this a few times and I always get to a place where inference can run at 50%+ gross margins, depending mostly on GPU depreciation and how good the host is at optimizing utilization. The challenge for the margins is whether or not you consider model training costs as part of the calculation. If model training isn't capitalized + amortized, margins are great. If they are amortized and need to be considered... yikes
[+] BlindEyeHalo|6 months ago|reply
Why wouldn't you factor in training? It is not like you can train once and then have the model run for years. You need to constantly improve to keep up with the competition. The lifespan of a model is just a few months at this point.
[+] ozgune|6 months ago|reply
I agree that you could get to high margins, but I think the modeling holds only if you're an AI lab operating at scale with a setup tuned for your model(s). I think the most open study on this one is from the DeepSeek team: https://github.com/deepseek-ai/open-infra-index/blob/main/20...

For others, I think the picture is different. When we ran benchmarks on DeepSeek-R1 on 8x H200 SXM using vLLM, we got up to 12K total tok/s (concurrency 200, input:output ratio of 6:1). If you're spiking up 100-200K tok/s, you need a lot of GPUs for that. Then, the GPUs sit idle most of the time.

I'll read the blog post in more detail, but I don't think the following assumptions hold outside of AI labs.

* 100% utilization (no spikes, balanced usage between day/night or weekdays) * Input processing is free (~$0.001 per million tokens) * DeepSeek fits into H100 cards in a way that network isn't the bottleneck

[+] lumost|6 months ago|reply
I wonder how much capex risk there is in this model, depreciating the GPUs over 5 years is fine if you can guarantee utilization. Losing market share might be a death sentence for some of these firms as utilization falls.
[+] next_xibalba|6 months ago|reply
> whether or not you consider model training costs as part of the calculation

Whether they flow through COGS/COR or elsewhere on the income statement, they've gotta be recognized. In which case, either you have low gross margins or low operating profit (low net income??). Right?

That said, I just can't conceive of a way that training costs are not hitting gross margins. Be it IFRS/GAAP etc., training is 1) directly attributable to the production of the service sold, 2) is not SG&A, financing, or abnormal cost, and thus 3) only makes sense to match to revenue.

[+] lawlessone|6 months ago|reply
Does that include legal fights and potential payouts to artists and writers whose work was used without permission?

Can anyone explain why it's not allowed to compensate the creators of the data?

[+] trilogic|6 months ago|reply
I have to disagree. The biggest cost is still energy consumption, water and maintenance. Not to mention, to keep up with the rivals in incredibly high tempo (so offering billions like Meta recently). Then the cost of hardware that is equal to Nvidia skyrocketing shares :) No one should dare to talk about profit yet. Now is time to grab the market, invest a lot and work hard, hopping for a future profit. The equation is still work on progress.
[+] noodletheworld|6 months ago|reply
Huh.

I feel oddly skeptical about this article; I can't specifically argue the numbers, since I have no idea, but... there are some decent open source models; they're not state of the art, but if inference is this cheap then why aren't there multiple API providers offering models at dirt cheap prices?

The only cheap-ass providers I've seen only run tiny models. Where's my cheap deepseek-R1?

Surely if its this cheap, and we're talking massive margins according to this, I should be able to get a cheap / run my own 600B param model.

Am I missing something?

It seems that reality (ie. the absence of people actually doing things this cheap) is the biggest critic of this set of calculations.

[+] dragonwriter|6 months ago|reply
> but if inference is this cheap then why aren't there multiple API providers offering models at dirt cheap prices

There are multiple API providers offering models at dirt cheap prices, enough so that there is at least one well-known API provider that is an aggreggator of other API providers that offers lots of models at $0.

> The only cheap-ass providers I've seen only run tiny models. Where's my cheap deepseek-R1?

https://openrouter.ai/deepseek/deepseek-r1-0528:free

[+] colinsane|6 months ago|reply
> I should be able to get a cheap / run my own 600B param model.

if the margins on hosted inference are 80%, then you need > 20% utilization of whatever you build for yourself for this to be less costly to you (on margin).

i self-host open weight models (please: deepseek et al aren't open _source_) on whatever $300 GPU i bought a few years ago, but if it outputs 2 tokens/sec then i'm waiting 10 minutes for most results. if i want results in 10s instead of 10m, i'll be paying $30000 instead. if i'm prompting it 100 times during the day, then it's idle 99% of the time.

coordinating a group buy for that $30000 GPU and sharing that across 100 people probably makes more sense than either arrangement in the previous paragraph. for now, that's a big component of what model providers, uh, provide.

[+] brokencode|6 months ago|reply
I also have no idea on the numbers. But I do know that these same companies are pouring many billions of dollars into training models, paying very expensive staff, and building out infrastructure. These costs would need to be factored in to come up with the actual profit margins.
[+] hirako2000|6 months ago|reply
Imo the article is totally off the mark since it assumes users on average do not go over th 1M tokens per day.

Afaik openai doesn't enforce a daily quota even on the $20 plans unless the platform is under pressure.

Since I often consume 20M token per day, one can assume many would use far more than the 1M tokens assumed in the article's calculations.

[+] johnsmith1840|6 months ago|reply
Another giant problem with this article is we have no idea the optimizations used on their end. There are some widly complex optimizations these large AI companies use.

What I'm trying to say is that hosting your own model is in an entierly different leauge than the pros.

If we account for error in article implies higher cost I would argue it would return back to profit directly because how advanced optimization of infer3nce has become.

If actual model intelligence is not a moat (looking likely this is true) the real sauce of profitable AI companies is advanced optimizations across the entire stack.

Openai is NEVER going to release their specialized kernels, routing algos, quanitizations or model comilation methods. These are all really hard and really specific.

[+] paulddraper|6 months ago|reply
I would not be surprised if the operating costs are modest

But these companies also have very expensive R&D development and large upfront costs.

[+] sc68cal|6 months ago|reply
This whole article is built off using DeepSeek R1, which is a huge premise that I don't think is correct. DeepSeek is much more efficient and I don't think it's a valid way to estimate what OpenAI and Anthropic's costs are.

https://www.wheresyoured.at/deep-impact/

Basically, DeepSeek is _very_ efficient at inference, and that was the whole reason why it shook the industry when it was released.

[+] qrios|6 months ago|reply
For sure an interesting calculation. Only one remark from someone with GPU metal experience:

> But compute becomes the bottleneck in certain scenarios. With long context sequences, attention computation scales quadratically with sequence length.

Even if the statement about quadratically scales is right, the bottleneck we are talking about is somewhere north by factor 1000. If 10k cores do only simple matrix operations each needs to have new data (up to 64k) available every 500 cycles (let's say). Getting these amount of data (without _any_ collision) means something like 100+GByte/s per core. Even 2+TByte/s on HBM means the bottleneck is the memory transfer rate, by something like 500 times. With collision, we talk about an additional factor like 5000 (last time I've done some tests with a 4090).

[+] gitremote|6 months ago|reply
These numbers are off.

> $20/month ChatGPT Pro user: Heavy daily usage but token-limited

ChatGPT Pro is $200/month and Sam Altman already admitted that OpenAI is losing money from Pro subscriptions in January 2025:

"insane thing: we are currently losing money on openai pro subscriptions!

people use it much more than we expected."

- Sam Altman, January 6, 2025

https://xcancel.com/sama/status/1876104315296968813

[+] ankit219|6 months ago|reply
This seems very very far off. From the latest reports, anthropic has a gross margin of 60%. It came out in their latest fundraising story. From that one The Information report, it estimated OpenAI's GM to be 50% including free users. These are gross margins so any amortization or model training cost would likely come after this.

Then, today almost every lab uses methods like speculative decoding and caching which reduce the cost and speed up things significantly.

The input numbers are far off. The assumption is 37B of active parameters. Sonnet 4 is supposedly a 100B-200B param model. Opus is about 2T params. Both of them (even if we assume MoE) wont have exactly these number of output params. Then there is a cost to hosting and activating params at inference time. (the article kind of assumes it would be the same constant 37B params).

[+] jonathan-adly|6 months ago|reply
Basically- the same math as modern automated manufacturing. Super expensive and complex build-out - then a money printer once running and optimized.

I know there is lots of bearish sentiments here. Lots of people correctly point out that this is not the same math as FAANG products - then they make the jump that it must be bad.

But - my guess is these companies end up with margins better than Tesla (modern manufacturer), but less than 80%-90% of "pure" software. Somewhere in the middle, which is still pretty good.

Also - once the Nvidia monopoly gets broken, the initial build out becomes a lot cheaper as well.

[+] a333999|6 months ago|reply
Why is everyone so mean? I believe the article delivered on its headline.

The author showed that no -large llm providers do not loose money on inference. Model training is not accounted for because that was not the point.

I personally felt that the maths calculations was a bit redundant, since after the maths part the same numbers are taken from open router pricing. But I think it is a matter of presentation.

I would have shown OR pricing first and then did the math. In that way it would have been as insightful, since it still showed that model providers do also make money and the reader would not have felt that he did maths he could have avoided :)

So, thanks to the author! Good job.

The feedback fro hn is overly harsh. Idk; It makes me sad how mean people have become. I guess the world is not in a great spot, but taking out anger on strangers will only make it worse.

[+] JCM9|6 months ago|reply
These articles (of which there are many) all make the same basic accounting mistakes. You have to include all the costs associated with the model, not just inference compute.

This article is like saying an apartment complex isn’t “losing money” because the monthly rents cover operating costs but ignoring the cost of the building. Most real estate developments go bust because the developers can’t pay the mortgage payment, not because they’re negative on operating costs.

If the cash flow was truly healthy these companies wouldn’t need to raise money. If you have healthy positive cash flow you have much better mechanisms available to fund capital investment other than selling shares at increasingly inflated valuations. Eg issue a bond against that healthy cash flow.

Fact remains when all costs are considered these companies are losing money and so long as the lifespan of a model is limited it’s going to stay ugly. Using that apartment building analogy it’s like having to knock down and rebuild the building every 6 months to stay relevant, but saying all is well because the rents cover the cost of garbage collection and the water bill. That’s simply not a viable business model.

Update Edit: A lot of commentary below re the R&D and training costs and if it’s fair to exclude that on inference costs or “unit economics.” I’d simply say inference is just selling compute and that should be high margin, which the article concludes it is. The issue behind the growing concerns about a giant AI bubble is if that margin is sufficient to cover the costs of everything else. I’d also say that excluding the cost of the model from “unit economics” calculations doesn’t make business/math/economics since it’s literally the thing being sold. It’s not some bit of fungible equipment or long term capital expense when they become obsolete after a few months. Take away the model and you’re just selling compute so it’s really not a great metric to use to say these companies are OK.

[+] moduspol|6 months ago|reply
This kind of presumes you're just cranking out inference non-stop 24/7 to get the estimated price, right? Or am I misreading this?

In reality, presumably they have to support fast inference even during peak usage times, but then the hardware is still sitting around off of peak times. I guess they can power them off, but that's a significant difference from paying $2/hr for an all-in IaaS provider.

I'm also not sure we should expect their costs to just be "in-line with, or cheaper than" what various hourly H100 providers charge. Those providers presumably don't have to run entire datacenters filled to the gills with these specialized GPUs. It may be a lot more expensive to do that than to run a handful of them spread among the same datacenter with your other workloads.

[+] martinald|6 months ago|reply
Yes. But these are on demand prices, so you could just turn them off when loads are less.

But there is no way that OpenAI should be more expensive than this. The main cost is the capex of the H100s, and if you are buying 100k at a time you should be getting a significant discount off list price.

[+] lolc|6 months ago|reply
Of course it is impossible for us to know the true cost, but idle instances should not be accounted for at full price:

1. Idle instances don't turn electricity to heat so that reduces their operating cost.

2. Idle instances can be borrowed for training which means flexible training amortizes peak inference capacity.

[+] empath75|6 months ago|reply
> In reality, presumably they have to support fast inference even during peak usage times, but then the hardware is still sitting around off of peak times. I guess they can power them off, but that's a significant difference from paying $2/hr for an all-in IaaS provider.

They can repurpose those nodes for training when they aren't being used for inference. Or if they're using public cloud nodes, just turn them off.

[+] KallDrexx|6 months ago|reply
Since DeepSeek R1 is open weight, wouldn't it be better to validate the napkin math to validate how many realistic LLM full inferences can be done on a single H100 in a time period, and calculate the token cost of that?

Without having in depth knowledge of the industry, the margin difference between input and output tokens is very odd to me between your napkin math and the R1 prices. That's very important as any reasoning model explodes reasoning tokens, which means you'll encounter a lot more output tokens for fewer input tokens, and that's going to heavily cut into the high margin ("essentially free") input token cost profit.

Unless I'm reading the article wrong.

[+] smcleod|6 months ago|reply
A few things:

1. Your token count per day seems quite low ("2M input tokens, ~30k output tokens/day") - that's FAR less than I'd expect,, for comparison I average 330M - 850M combined tokens per day, I'm on the higher side of my peers that average 150M-600M combined tokens per day.

2. It doesn't seem you're taking prompt caching into account. This generally reduces the inference required for agentic coding by 85-95%.

3. It would be good if you added what quantisation you're running, for example 8.5-9bpw / (Q8 equivalent) (indistinguishable from fp32/bf16) for the model, and for the KV cache (Q8/(b)f16 etc..).

[+] caminanteblanco|6 months ago|reply
Ok, one issue I have with this analysis is the breakdown between input and output tokens. I'm the kind of person who spend most of my chat asking questions, so I might only use 20ish input tokens per prompt, where Gemini is having to put out several hundred, which would seem to affect the economics quite a bit
[+] yalogin|6 months ago|reply
Will these companies ever stop training new models? What does it mean if we get there. Feels like they will have to constantly train and improve the models, not sure what that means either. What ncremental improvements can these models show?

Another question is - will it ever become less costly to train?

Let to see opinions from someone in the know

[+] ekelsen|6 months ago|reply
The math on the input tokens is definitely wrong. It claims each instance (8 GPUs) can handle 1.44 million tokens/sec of input. Let's check that out.

1.44e6 tokens/sec * 37e9 bytes/token / 3.3e12 bytes/sec/GPU = ~16,000 GPUs

And that's assuming a more likely 1 byte per parameter.

So the article is only off by a factor of at least 1,000. I didn't check any of the rest of the math, but that probably has some impact on their conclusions...

[+] thatguysaguy|6 months ago|reply
37 billion bytes per token?

Edit: Oh assuming this is an estimate based on the model weights moving fromm HBM to SRAM, that's not how transformers are applied to input tokens. You only have to do move the weights for every token during generation, not during "prefill". (And actually during generation you can use speculative decoding to do better than this roofline anyways).

[+] GaggiX|6 months ago|reply
Your calculations make no sense. Why are you loading the model for each token independently? You can process all the input tokens at the same time as long as they can fit in memory.

You are doing the calculation as they were output tokens on a single batch, it would not make sense even in the decode phase.

[+] endtime|6 months ago|reply
> 37e9 bytes/token

This doesn't quite sound right...isn't a token just a few characters?

[+] mrcwinn|6 months ago|reply
As the author seems to admit, an outsider is going to lack so much information (costs, loss leaders, etc), one has to assume any modeling is so inaccurate that it's not worth anything.

So the question remains unanswered, at least for us. For those putting money in, you can be absolutely certain they have a model with sufficient data to answer the question. Since money did go in, even if it's venture, the answer is probably "yes in the immediate, but no over time."