> the small model category as a whole is seeing its share of usage decline.
It's important to remember that this data is from OpenRouter... a API service. Small models are exactly those that can be self-hosted.
It could be the case that total small model usage has actually grown, but people are self-hosting rather than using an API. OpenRouter would not be in a position to determine this.
Thank you & totally agree! The findings are purely observational through OpenRouter’s lens, so they naturally reflect usage on the platform, not the entire ecosystem.
Yeah, using an API aggregator to run a 7B model is economically strange if you have even a consumer GPU. OpenRouter captures the cream of complex requests (Claude 3.5, o1) that you can't run at home. But even for local hosting, medium models are starting to displace small ones because quantization lets you run them on accessible hardware, and the quality boost there is massive. So the "Medium is the new Small" trend likely holds true for the self-hosted segment as well.
While it is possible to self-host small models, it is not easy to host them with high speeds. Many small-model use-cases are for large batches of work (processing large amounts of documents, agentic workflows, ...), and then using a provider that has high tps numbers would be motivated.
Still, I agree that self-hosting is probably a part of the decrease.
The bigger issue is that they count small based on fixed number of parameters, and not the active parameter for MoE, didn't account for any hardware improvements etc. If they counted small based on the price or computational cost, I think they would have seen increase in small models.
I like to see stats like that, but I find it very concerning that OpenRouter don't mind inspecting its user/customer data without shame.
Even if you pretend that the classifier respect anonymity, if I pay for the inference, I would expect that it would be a closed tube with my privacy respected.
If at least it was for "safety" checks, I don't like that but I would almost understand, now it is for them to have "marketing data".
Imagine, and regarding the state of the world it might come soon, that you have whatsapp or telegram that inspect all the messages that you send to give reports like:
- 20% of our users speak about their health issues
This is the inevitable evil of the man in the middle. OpenRouter by definition decrypts your traffic to route it to the provider (OpenAI, Anthropic). Technically, they can read everything
The problem is that for the Enterprise segment, this is a showstopper. No bank or hospital will route data through an aggregator that openly states it classifies prompts via Google API (even sampled ones). This confirms that OpenRouter remains a tool for indie hackers and researchers, not for serious B2B
why imagine? The world already functions exactly like that. Talk on Tg like every chat is summarized every 24hrs and monthly (with cheap LLM and then with strong ones if signals found), and it reports to all kinds of interested intel agencies.
Same for openrouter. everything that leaves your device plaintext = public. Period. No hopes.
Alex here from OpenRouter. We take privacy really seriously and how we do manage prompts and completions is described in our terms of service: https://openrouter.ai/terms.
We don’t retain any customer prompts or completions by default. As others here mentioned, you can opt-in for a 1% discount. Prompt classification is performed using a zero-data-retention and zero-training service, just like our own.
Very interesting how Singapore ranks 2nd in terms of token volume. I wonder if this is potentially Chinese usage via VPN, or if Singaporean consumers and firms are dominating in AI adoption.
Also interesting how the 'roleplaying' category is so dominant, makes me wonder if Google's classifier sees a system prompt with "Act as a X" and classifies that as roleplay vs the specific industry the roleplay was intended to serve.
Almost certainly VPN traffic. Most major LLMs block both China and Hong Kong (surprisingly, not the other way around), so Singapore ends up being the fastest nearby endpoint that isn't restricted.
Or maybe it’s just strange classification. I see a lot of prompts on the internet looking like “act as a senior xxx expert with over 15 years of industry experience and answer the following: [insert simple question]”
I hope those are not classified as “roleplaying” the “roleplay” here is just a trick to get better answer from the model, often in a professional setting that has nothing to do with creative writing of NSFW stuff
> I guess that also selects for people who would use openrouter
It definitely does. OpenRouter is pretty popular among roleplayers and creative writers due to having a wide variety of models available, sometimes providing free access to quality models such as DeepSeek, and lacking any sort of rules against generating "adult" content.
Openrouter has an apps tab. If you look at the free, non-coding models, some apps that feature are: janitor.ai, sillytavern, chub.ai. I'd never heard of them but people seem to be burning millions of tokens enjoying them.
If you rely on AI to write most of your code (instead of using it like Stackoverflow), Claude Code/OpenAI Codex subscription are cheaper than buying tokens. So those users are not on openrouter.
I'm not surprised. Roleplay means endless sessions with huge context (character history, world, previous dialogues). On commercial APIs (OpenAI/Anthropic), that long-context costs a fortune. On OpenRouter, many OSS models, especially via providers like DeepInfra or Fireworks, cost pennies or are even free, like some Free-tier models. The RP community is very price-sensitive, so they massively migrate to cheap OSS models via aggregators. It skews the stats but highlights a real niche for cheap inference
That also stuck out for me, I was wondering if it was video games using openrouter for uptime / inference switching, video games would use a lot of tokens generating dialogue for a few programmer's villages.
I'm not surprised at all. The HN crowd think LLMs are mostly used for engineering because they live in a multi layer bubble. Real people in the real world do all kind of shit with LLMs which aren't productivity or even work related.
> The noticeable spike [~20 percentage points] in May in the figure above [tool invocations] was largely attributable to one sizable account whose activity briefly lifted overall volumes.
The fact that one account can have such a noticeable effect on token usage is kind of insane. And also raises the question of how much token usage is coming from just one or five or ten sizeable accounts.
It is quite interesting to ponder these usage statistics, isn't it?
According to their charts they're at a throughput of something like 7T tok/week total now. At 1$/Mtok, that's 7M$ per week. Less than half a billion per year. How much is that compared to the total inference market? And yet again, their throughput went like 20x in one year, who knows what's to come...
The 'Glass slipper' idea makes sense to me; people have a bunch of different ideas to try on AIs, and try it as new models come out, and once a model does it well they stick with it for a while.
This is interesting, but I found it moderately disturbing that they spend a LOT of effort up front talking about how they don’t have any access to the prompts or responses. And then they reveal that they did actually have access to the text and they spend 80% of the rest of the paper analyzing the content.
I worry that OpenRouter's Apps leaderboard incentivizes tools (e.g. Cline/Kilo) to burn through tokens to climb the ranks, meanwhile penalizing being context-efficient.
The open weight model data is very interesting. I missed the release of Minimax M2. The benchmarks seem insanely impressive for its size. I would suspect benchmaxing but why would people be using it if it wasn’t useful?
The 4x growth in prompt length is a fundamental shift. We've quickly moved from "Q&A" mode to "upload full context and analyze" mode.
This completely changes infrastructure requirements: KV-caching becomes a necessity, and prefill time becomes a critical metric, often more important than generation speed. That's exactly why models with cheap long context (Gemini, DeepSeek) are winning the race against "smarter" but expensive models. Inference economics are now dictated by context length
These are fantastic insights! I work in legaltech space so something to keep in mind is that legal space is very sensitive to data storage and security (apart from this of course: https://alexschapiro.com/security/vulnerability/2025/12/02/f...). So models hosted in e.g. Azure, or on-prem deployments are more common. I have friends in health space and similar story there. Finance (banking especially) is the same. Hence why those categories look more or less constant over time, and have smallest contributions in this study.
Here is the thing: they made good enough open weight models available and affordable, then found that people used them more than before. I am not trying to diminish the value here but I don’t think this is the headline.
> The metric reflects the proportion of all tokens served by reasoning models, not the share of "reasoning tokens" within model outputs.
I'd be interested in a clarification on the reasoning vs non-reasoning metric.
Does this mean the reasoning total is (input + reasoning + output) tokens? Or is it just (input + output).
Obviously the reasoning tokens would add a ton to the overall count. So it would be interesting to see it on an apples to apples comparison with non reasoning models.
As would models that that are overly verbose. My experience is the Claude tends to do more than is asked for (e.g. immediately move on to creating tests and documentation) while other models like Gemini tend to be more concise in what they do.
I'm out of time but "reasoning input tokens" from fortune 5000 engineers sounds like a lobotomized LSD dream, would you care on elaborating how you distinguish between reasoning and non-reasoning? vs "question on duty"?
It was (is?) free with eg. opencode -- so, open-source coding agent + free sota model, it's hard to resist. That said, grok fast is fast, but not that great when compared to the other top tier models.
It's a 1.7 trillion token free model. Why wouldn't you try it?
I've been testing free models for coding hobby projects after I burnt through way too many expensive tokens on Replit and Claude. Grok wasn't great, kept getting into loops for me. I had better results using KAT coder on opencode (also free).
Kilo Code lets people use Grok Code Fast 1 for free, using OpenRouter as the provider. And Grok 4.1 Fast was completely free directly on OpenRouter for some time after its release.
So yeah, their statistics are inflated quite a bit, since most of that usage was not paid for, or at least not by the end user.
They have SOTA models from OpenAI and Anthropic and Google and you can access them at a 5.5% premium. What you get is the ability to seamlessly switch between them. And also when one is down you can instantly switch to another. Whether that is valuable to you or not is use case dependent. But it isn’t without value.
What it does have I think is a problem that TaskRabbit had: you can hire a house cleaner through TR but once you find a good one you can just work directly with them and save the middleman fee. So OR is great for experimenting with a ton of models to see what is the cheapest one that still performs the tasks you need but then you no longer need OR unless it is for reliability.
Statistical significance comes mostly from N (number of samples) and the variance on the dimension you're trying to measure[1]. If the variance is high, you'll need higher N. If the variance is low, you'll need a lower N. The percentage of the population is not relevant (N = 1000 might be significant and it doesn't matter if it's 1% or 30% of the population)
[^1] This is a simplification. I should say that it depends on the standard error of your statistic, i.e, the thing you're trying to measure (If you're estimating the max of a population, that's going to require more samples than if you're estimating the mean). This standard error, in turn, will depend on the standard deviation of the dimension you're measuring. For example, if you're estimating the mean height, the relevant quantity is the standard deviation of height in the population.
For example, even 300 really random people is enough to correctly assertain the distribution of population for some measurement (say, some personality feauture).
Because the accuracy of an estimated quantity mostly depends on the size of the sample, not on the size of the population [1]. This does require assumptions like somewhat homogenous population and normal distributions etc. However, these assumptions often hold.
I am a person who wants to maintain a distance from the AI-hype train, but seeing a chart like this [1], I can't help think that we are nowhere near the peak. The weekly token consumption keeps on rising, and it's already in trillions, and this ignores a lot of consumption happening directly through APIs.
Nvidia could keep delivering record-breaking numbers, and we may well see multiple companies hit six, seven, or even eight trillion dollars in market cap within a couple of years. While I am skeptical of claims like AI will make programming obsolete, but it’s clear that the adoption is still going like crazy and it's hard to anticipate when the plateau happens.
Problem is that most of that growth is in models being underpriced. We don't know what the demand curve looks like when tokens are priced to cover the full operating costs of the companies making them.
When it's as cheap as 5 cents per million tokens I don't see "trillions" as being a particularly large number. Even at the most expensive level($120/1M for 5 Pro) 100 trillion tokens is only like $12 billion dollars.
lukev|2 months ago
I do question this finding:
> the small model category as a whole is seeing its share of usage decline.
It's important to remember that this data is from OpenRouter... a API service. Small models are exactly those that can be self-hosted.
It could be the case that total small model usage has actually grown, but people are self-hosting rather than using an API. OpenRouter would not be in a position to determine this.
maikakz|2 months ago
veunes|2 months ago
mzl|2 months ago
Still, I agree that self-hosting is probably a part of the decrease.
YetAnotherNick|2 months ago
greatgib|2 months ago
Even if you pretend that the classifier respect anonymity, if I pay for the inference, I would expect that it would be a closed tube with my privacy respected. If at least it was for "safety" checks, I don't like that but I would almost understand, now it is for them to have "marketing data".
Imagine, and regarding the state of the world it might come soon, that you have whatsapp or telegram that inspect all the messages that you send to give reports like:
- 20% of our users speak about their health issues
- 30% of messages are about annoying coworkers
- 15% are messages comparing dick sizes
stingraycharles|2 months ago
veunes|2 months ago
est|2 months ago
Even better: they send all the data to GoogleTagClassifier, which means now Google had a copy of the sample as well
homakov|2 months ago
why imagine? The world already functions exactly like that. Talk on Tg like every chat is summarized every 24hrs and monthly (with cheap LLM and then with strong ones if signals found), and it reports to all kinds of interested intel agencies.
Same for openrouter. everything that leaves your device plaintext = public. Period. No hopes.
heliumtera|2 months ago
Lol hahaha
xanderatallah|2 months ago
We don’t retain any customer prompts or completions by default. As others here mentioned, you can opt-in for a 1% discount. Prompt classification is performed using a zero-data-retention and zero-training service, just like our own.
majdalsado|2 months ago
Also interesting how the 'roleplaying' category is so dominant, makes me wonder if Google's classifier sees a system prompt with "Act as a X" and classifies that as roleplay vs the specific industry the roleplay was intended to serve.
olalonde|2 months ago
syspec|2 months ago
I'm pretty surprised by that, but I guess that also selects for people who would use openrouter
IMTDb|2 months ago
I hope those are not classified as “roleplaying” the “roleplay” here is just a trick to get better answer from the model, often in a professional setting that has nothing to do with creative writing of NSFW stuff
bakugo|2 months ago
It definitely does. OpenRouter is pretty popular among roleplayers and creative writers due to having a wide variety of models available, sometimes providing free access to quality models such as DeepSeek, and lacking any sort of rules against generating "adult" content.
djfergus|2 months ago
raincole|2 months ago
unknown|2 months ago
[deleted]
veunes|2 months ago
ceroxylon|2 months ago
cess11|2 months ago
lm28469|2 months ago
m0rde|2 months ago
The fact that one account can have such a noticeable effect on token usage is kind of insane. And also raises the question of how much token usage is coming from just one or five or ten sizeable accounts.
nhaehnle|2 months ago
According to their charts they're at a throughput of something like 7T tok/week total now. At 1$/Mtok, that's 7M$ per week. Less than half a billion per year. How much is that compared to the total inference market? And yet again, their throughput went like 20x in one year, who knows what's to come...
trebligdivad|2 months ago
skywhopper|2 months ago
charcircuit|2 months ago
I'm not seeing that. All I'm seeing is them analyzing metadata.
paulirish|2 months ago
https://openrouter.ai/rankings#apps
sosodev|2 months ago
pestaa|2 months ago
veunes|2 months ago
This completely changes infrastructure requirements: KV-caching becomes a necessity, and prefill time becomes a critical metric, often more important than generation speed. That's exactly why models with cheap long context (Gemini, DeepSeek) are winning the race against "smarter" but expensive models. Inference economics are now dictated by context length
armcat|2 months ago
IgorPartola|2 months ago
themanmaran|2 months ago
I'd be interested in a clarification on the reasoning vs non-reasoning metric.
Does this mean the reasoning total is (input + reasoning + output) tokens? Or is it just (input + output).
Obviously the reasoning tokens would add a ton to the overall count. So it would be interesting to see it on an apples to apples comparison with non reasoning models.
ribosometronome|2 months ago
reeeli|2 months ago
adidoit|2 months ago
Most of the high volume enterprise use cases use their cloud providers (e.g., azure)
What we have here is mostly from smaller players. Good data but obviously a subset of the inference universe.
asadm|2 months ago
btbuildem|2 months ago
djfergus|2 months ago
I've been testing free models for coding hobby projects after I burnt through way too many expensive tokens on Replit and Claude. Grok wasn't great, kept getting into loops for me. I had better results using KAT coder on opencode (also free).
bakugo|2 months ago
So yeah, their statistics are inflated quite a bit, since most of that usage was not paid for, or at least not by the end user.
joshuamcginnis|2 months ago
nextworddev|2 months ago
All this data confirms that OpenRouter’s enterprise ambitions will fail. It’s a nice product for running Chinese models tho
IgorPartola|2 months ago
What it does have I think is a problem that TaskRabbit had: you can hire a house cleaner through TR but once you find a good one you can just work directly with them and save the middleman fee. So OR is great for experimenting with a ton of models to see what is the cheapest one that still performs the tasks you need but then you no longer need OR unless it is for reliability.
swyx|2 months ago
meander_water|2 months ago
> OpenRouter performs internal categorization on a random sample comprising approximately 0.25% of all prompts
How can you arrive at any conclusion with such a small random sample size?
hoppoli|2 months ago
[^1] This is a simplification. I should say that it depends on the standard error of your statistic, i.e, the thing you're trying to measure (If you're estimating the max of a population, that's going to require more samples than if you're estimating the mean). This standard error, in turn, will depend on the standard deviation of the dimension you're measuring. For example, if you're estimating the mean height, the relevant quantity is the standard deviation of height in the population.
piskov|2 months ago
For example, even 300 really random people is enough to correctly assertain the distribution of population for some measurement (say, some personality feauture).
That’s the basis of all polls and what have you
abdullahkhalids|2 months ago
[1] https://stats.stackexchange.com/questions/166/how-do-you-dec...
jfrbfbreudh|2 months ago
shubhamjain|2 months ago
Nvidia could keep delivering record-breaking numbers, and we may well see multiple companies hit six, seven, or even eight trillion dollars in market cap within a couple of years. While I am skeptical of claims like AI will make programming obsolete, but it’s clear that the adoption is still going like crazy and it's hard to anticipate when the plateau happens.
[1]: https://openrouter.ai/state-of-ai#open-vs_-closed-source-mod...
mike_hearn|2 months ago
Also, growth seems to be linear, not exponential.
hattmall|2 months ago
adamraudonis|2 months ago
typs|2 months ago