top | item 46542639

(no title)

IMTDb | 1 month ago

A key difference is that the cost to execute a cab ride largely stayed the same. Gas to get you from point A to point B is ~$5, and there's a floor on what you can pay the driver. If your ride costs $8 today, you know that's unsustainable; it'll eventually climb to $10 or $12.

But inference costs are dropping dramatically over time, and that trend shows no signs of slowing. So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.

Of course, by then we'll have much more capable models. So if you want SOTA, you might see the jump to $10-12. But that's a different value proposition entirely: you're getting significantly more for your money, not just paying more for the same thing.

discuss

lompad|1 month ago

>But inference costs are dropping dramatically over time,

Please prove this statement, so far there is no indication that this is actually true - the opposite seems to be the case. Here are some actual numbers [0] (and whether you like Ed or not, his sources have so far always been extremely reliable.)

There is a reason the AI companies don't ever talk about their inference costs. They boast with everything they can find, but inference... not.

[0]: https://www.wheresyoured.at/oai_docs/

patresh|1 month ago

I believe OP's point is that for a given model quality, inference cost decreases dramatically over time. The article you linked talks about effective total inference costs which seem to be increasing.

Those are not contradictory: a company's inference costs can increase due to deploying more models (Sora), deploying larger models, doing more reasoning, and an increase in demand.

However, if we look purely at how much it costs to run inference on a fixed amount of requests for a fixed model quality, I am quite convinced that the inference costs are decreasing dramatically. Here's a model from late 2025 (see Model performance section) [1] with benchmarks comparing a 72B parameter model (Qwen2.5) from early 2025 to the late 2025 8B Qwen3 model.

The 9x smaller model outperforms the larger one from earlier the same year on 27 of the 40 benchmarks they were evaluated on, which is just astounding.

[1] https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct

academia_hack|1 month ago

Anecdotally, I find you can tell if someone worked at a big AI provider or a small AI startup by proposing an AI project like this:

" First we'll train a custom trillion parameter LLM for HTML generation. Then we'll use it to render our homepage to our 10 million daily visitors. "

The startup people will be like "this is a bad idea because you don't have enough GPUs for training that LLM" and the AI lab folks will be like "How do you intend to scale inference if you're not Google?"

changbai|1 month ago

Inference cost for leading models and more complex tasks is high. However, inference cost for a stationary model and task has dropped drastically.

https://a16z.com/llmflation-llm-inference-cost/ for example shows this to be true.

The report from OpenRouter https://openrouter.ai/state-of-ai also makes the same observation.

SecretDreams|1 month ago

> But inference costs are dropping dramatically over time, and that trend shows no signs of slowing. So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.

I'd like to see this statement plotted against current trends in hardware prices ISO performance. Ram, for example, is not meaningfully better than it was 2 years ago, and yet is 3x the price.

I fail to see how costs can drop while valuations for all major hardware vendors continue to go up. I don't think the markets would price companies in this way if the thought all major hardware vendors were going to see margins shrink a la commodity like you've implied.

santadays|1 month ago

I've seen the following quote.

"The energy consumed per text prompt for Gemini Apps has been reduced by 33x over the past 12 months."

My thinking is that if Google can give away LLM usage (which is obviously subsidized) it can't be astronomically expensive, in the realm of what we are paying for ChatGPT. Google has their own TPUs and company culture oriented towards optimizing the energy usage/hardware costs.

I tend to agree with the grandparent on this, LLMs will get cheaper for what we have now level intelligence, and will get more expensive for SOTA models.

xpe|1 month ago

> I fail to see how costs can drop while valuations for all major hardware vendors continue to go up. I don't think the markets would price companies in this way if the thought all major hardware vendors were going to see margins shrink a la commodity like you've implied.

This isn't hard to see. A company's overall profits are influenced – but not determined – by the per-unit economics. For example, increasing volume (quantity sold) at the same per-unit profit leads to more profits.

PaulHoule|1 month ago

It's not the hardware getting cheaper, it's that LLMs were developed when we really didn't understand how they worked, and there is still some room to improve the implementations, particularly do more with less RAM... And that's everything from doing more with fewer weights to things like FP16, not to mention if you can 2x the speed you can get twice as much done with the same RAM and all the other parts.

hug|1 month ago

> I'd like to see this statement plotted against current trends in hardware prices ISO performance.

Prices for who? The prices that are being paid by the big movers in the AI space, for hardware, aren't sticker price and never were.

The example you use in your comment, RAM, won't work: It's not 3x the price for OpenAI, since they already bought it all.

mcphage|1 month ago

> So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.

The same task on the same LLM will cost $8 or less. But that's not what vendors will be selling, nor what users will be buying. They'll be buying the same task on a newer LLM. The results will be better, but the price will be higher than the same task on the original LLM.

doctorpangloss|1 month ago

> I fail to see how costs can drop while valuations for all major hardware vendors continue to go up.

yeah. valuations for hardware vendors have nothing to do with costs. valuations are a meaningless thing to integrate into your thinking about something objective like, will the retail costs of inference trend down (obviously yes)

glemion43|1 month ago

[deleted]

forty|1 month ago

What if we run out of GPU? Out of RAM? Out of electricity?

AWS is already raising GPU prices, that never happened before. What if there is war in Taiwan? What if we want to get serious about climate change and start saving energy for vital things ?

My guess is that, while they can do some cool stuff, we cannot afford LLMs in the long run.

jiggawatts|1 month ago

> What if we run out of GPU?

These are not finite resources being mined from an ancient alien temple.

We can make new ones, better ones, and the main ingredients are sand and plastic. We're not going to run out of either any time soon.

Electricity constraints are a big problem in the near-term, but may sort themselves out in the long-term.

iwontberude|1 month ago

Your point could have made sense but the amount of inference per request is also going up faster than the costs are going down.

supern0va|1 month ago

The parent said: "Of course, by then we'll have much more capable models. So if you want SOTA, you might see the jump to $10-12. But that's a different value proposition entirely: you're getting significantly more for your money, not just paying more for the same thing."

SOTA improvements have been coming from additional inference due to reasoning tokens and not just increasing model size. Their comment makes plenty of sense.

manmal|1 month ago

Is it? Recent new models tend to need fewer tokens to achieve the same outcome. The days of ultrathink are coming to an end, Opus is well usable without it.

unknown|1 month ago

[deleted]