> Some coding agents (Shelley included!) refuse to return a large tool output back to the agent after some threshold. This is a mistake: it's going to read the whole file, and it may as well do it in one call rather than five.
disagree with this: IMO the primary reason that these still need to exist is for when the agent messes up (e.g reads a file that is too large like a bundle file), or when you run a grep command in a large codebase and end up hitting way too many files, overloading context.
Otherwise lots of interesting stuff in this article! Having a precise calculator was very useful for the idea of how many things we should be putting into an agent loop to get a cost optimum (and not just a performance optimum) for our tasks, which is something that's been pretty underserved.
I think that's reasonable, but then they should have the ability for the agent to, on the next call, override it. Even if it requires the agent to have read the file once or something.
In the absence of that you end up with what several of the harnesses ended up doing, where an agent will use a million tool calls to very slowly read a file in like 200 line chunks. I think they _might_ have fixed it now (or agent-fixes, my agent harness might be fixing it), but Codex used to do this and it made it unbelievably slow.
> when you run a grep command in a large codebase and end up hitting way too many files, overloading context.
On the other hand, I despise that it automatically pipes things through output-limiting things like `grep` with a filter, `head`, `tail`, etc. I would much rather it try to read a full grep and then decide to filter-down from there if the output is too large -- that's exactly what I do when I do the same workflow I told it to do.
Why? Because piping through output liming things can hide the scope of the "problem" I'm looking at. I'd rather see the scope of that first so I can decide if I need to change from a tactical view/approach to a strategic view/approach. It would be handy if the agents could do the same thing -- and I suppose they could if I'm a little more explicit about it in my tool/prompt.
> By 50,000 tokens, your conversation’s costs are probably being dominated by cache reads.
Yeah, it's a well-known problem. Every AI company is working on ways to deal with it, one way or another, with clever data center design, and/or clever hardware and software engineering, and/or with clever algorithmic improvements, and/or with clever "agentic recursive LLM" workflows. Anything that actually works is treated like a priceless trade secret. Nothing that can put competitors at a disadvantage will get published any time soon.
There are academics who have been working on it too, most notably Tri Dao and Albert Gu, the key people behind FlashAttention and SSMs like Mamba. There are also lots of ideas out there for compressing the KV cache. No idea if any of them work. I also saw this recently on HN: https://news.ycombinator.com/item?id=46886265 . No idea if it works but the authors are credible. Agentic recursive LLMs look most promising to me right now. See https://arxiv.org/abs/2512.24601 for an intro to them.
What do you think about RLMs? At first blush it looks like sub agents with some sprinkles on top, but people who have become more adept with it seem to show its ability to handle sublinear context scaling behavior very effectively.
the quadratic curve makes sense but honestly what kills us more is the review cost - AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep. we burn more time auditing AI output than we save on writing it, and that compounds. the API costs are predictable at least
> AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep
If the abstraction that the code uses is "right", there will be hardly any edge cases, and something to break three layers deep.
Even though I am clearly an AI-hater, for this very specific problem I don't see the root cause in these AI models, but in the programmers who don't care about code quality and thus brutally reject code that is not of exceptional quality.
> then you're stuck reading every line because it might've missed some edge case or broken something
This is what tests are for. Humans famously write crap code. They read it and assume they know what's going on, but actually they don't. Then they modify a line of code that looks like it should work, and it breaks 10 things. Tests are there to catch when it breaks so you can go back and fix it.
Agents are supposed to run tests as part of their coding loops, modifying the code until the tests pass. Of course reward hacking means the AI might modify the test to 'just pass' to get around this. So the tests need to be protected from the AI (in their own repo, a commit/merge filter, or whatever you want) and curated by humans. Initial creation by the AI based on user stories, but test modifications go through a PR process and are scrutinized. You should have many kinds of tests (unit, integration, end-to-end, regression, etc), and you can have different levels of scrutiny (maybe the AI can modify unit tests on the fly, and in PRs you only look at the test modifications to ensure they're sane). You can also have a different agent with a different prompt do a pre-review to focus only on looking for reward hacks.
> AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep
I will imagine that in the future this will be tackled with a heavy driven approach and tight regulation of what the agent can and cannot touch. So frequent small PRs over big ones. Limit folder access to only those that need changing. Let it build the project. If it doesn't build, no PR submissions allowed. If a single test fails, no PR submissions allowed. And the tests will likely be the first if not the main focus in LLM PRs.
I use the term "LLM" and not "AI" because I notice that people have started attributing LLM related issues (like ripping off copyrighted material, excessive usage of natural resources, etc) to AI in general which is damaging for the future of AI.
What surprises me is that this obvious inefficiency isn't competed out of the market. Ie this is clearly such a suboptimal use of time and yet lots of companies do it and don't get competed out by other ones that don't do this
I disagree. I used to spend most of my time writing code, fixing syntax, thinking through how to structure the code, looking up documentation on how to use a library.
Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.
Bugs that the AI agent would write, I would have also wrote. Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.
I also find that the AI writes more bug free code than I did. It handles cases that I wouldn’t have thought of. It used best practices more often than I did.
Maybe I was a bad dev before LLMs but I find myself producing better quality applications much quicker.
I'm not sure, but I think that cached read costs are not the most accurately priced, if you consider your costs to be costs when consuming an API endpoint, then the answer will be 50k tokens, sure. But if you consider how much it costs the provider, cached tokens probably have a way higher margin than (the probably negative margin of ) input and output inference tokens.
Most caching is done without hints from the application at this point, but I think some APIs are starting to take hints or explicit controls for keeping state associated with specific input tokens in memory, so these costs will go down, in essence you really don't reprocess the input token at inference, if you own the hardware it's quite trivial to infer one output token at a time, there's no additional cost, if you have 50k input tokens, and you generate 1 output token, it's not like you have to "reinfer" the 50k input tokens before you output the second token.
To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.
This is relevant in an application I'm working on where I check the logprobs and not always choose the most likely token(for example by implementing a custom logit_bias mechanism client-side), so you can infer 1 output token at a time. This is not quite possible with most APIs, but if you control the hardware and use (virtually) 0 cost cached tokens, you can do it.
So bottomline, cached input tokens are almost virtually free naturally (unless you hold them for a loong period of time), the price of cached input APIs is probably due to the lack of API negotiation as to what inputs you want to cache. As APIs and self-hosted solutions evolve, we will likely see the cost of cached inputs masssively drop down to almost 0. With efficient application programming the only accounting should be for output tokens and system prompts. Your output tokens shouldn't be charged again as inputs, at least not more than once.
While some efficiencies could be gained from better client-server negotiation, the cost will never be 0. It isn't 0 even in "lab conditions", so it can't be 0 at scale. There are a few miss-conceptions in your post.
> the time it takes to generate the Millionth output token is the same as the first output token.
This is not true, even if you have the kv cache "hot" in vram. That's just not how transformers work.
> cached input tokens are almost virtually free naturally
No, they are not in practice. There are pure engineering considerations here. How do you route, when you evict kv cache, where you evict it to (RAM/nvme), how long you keep it, etc. At the scale of oAI/goog/anthropic these are not easy tasks, and the cost is definetly not 0.
Think about a normal session. A user might prompt something, wait for the result, re-prompt (you hit "hot" cache) and then go for a coffee. They come back 5 minutes later. You can't keep that in "hot" cache. Now you have to route the next message in that thread to a) a place where you have free "slots"; b) a place that can load the kv cache from "cold" storage and c) a place that has enough "room" to handle a possible max ctx request. These are not easy things to do in practice, at scale.
Now consider 100k users doing basically this, all day long. This is not free and can't become free.
GPU VRAM has an opportunity cost, so caching is never free. If that RAM is being used to hold KV caches in the hope that they'll be useful in future, but you lose that bet and you never hit that cache, you lost money that could have been used for other purposes.
Caching might be free, but I think making caching cost nothing at the API level is not a great idea either considering that LLM attention is currently more expensive with more tokens in context.
Making caching free would price "100000 token cache, 1000 read, 1000 write" the same as "0 token cache, 1000 read, 1000 write", whereas the first one might cost more compute to run. I might be wrong at the scale of the effect here though.
> To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.
This is wrong. Current models still use some full attention layers AFAIK, and their computational cost grows linearly (per token) with the token number.
Nice article. I think a key part of the conversation is getting people to start thinking in terms of evals [1] and observability but it's been quite tough to combat the hype of "but X magic product just solves what you mentioned as a concern for you".
You'd think cost is an easy talking point to help people care but the starting points for people are so heterogeneous that it's tough to show them they can take control of this measurement themselves.
I say the latter because the article is a point in time and if they didn't have a recurrent observation around this, some aspects may radically change depending on the black box implementations of the integrations they depend on (or even the pricing strategies).
what i've learned running multi-agent workflows...
>use the expensive models for planning/design and the cheaper models for implementation
>stick with small/tightly scoped requests
>clear the context window often and let the AGENTS.md files control the basics
there’s something of a paradox there. Reduce the context window and work on smaller/tightly scoped requests? Isn’t the whole value proposition that I can work much faster? To do that, I naturally try to describe what I want at a higher, vaguer level.
Depends on which cache you mean. The KV Cache gets read on every token generated, but the prompt cache (which is what incurs the cache read cost) is read on conversation starts.
In my experience building agent pipelines, the real cost explosion happens at tool call chains - each hop multiplies tokens in ways that are hard to anticipate . Adding structured logging per step helped me identify and cut the worst offenders.
TFA is talking about being quadratic in dollar cost as the conversation goes on, not quadratic in time complexity as n gets larger.
Edit to add: I see you are a new account and all your comments thus far are of a similar format, which seems highly suspicious. In the unlikely event you are a human, please read the hacker news guidelines https://news.ycombinator.com/newsguidelines.html
> Instead of feeding 500 lines of tool output back into the next prompt
Applies for everything with LLMs.
Somewhere along the idea, it seems like most people got the idea that "More text == better understanding" whereas reality seems to be the opposite, the less tokens you can give the LLM with only the absolute essentials, the better.
The trick is to find the balance, but "more == better" which many users seem to operate under seems to be making things worse, not better.
Obviously you don't have to throw the data away, if the initial summary was missing some important detail, the agent can ask for additional information from a subthread/task/tool call.
stuxf|14 days ago
disagree with this: IMO the primary reason that these still need to exist is for when the agent messes up (e.g reads a file that is too large like a bundle file), or when you run a grep command in a large codebase and end up hitting way too many files, overloading context.
Otherwise lots of interesting stuff in this article! Having a precise calculator was very useful for the idea of how many things we should be putting into an agent loop to get a cost optimum (and not just a performance optimum) for our tasks, which is something that's been pretty underserved.
tekacs|14 days ago
In the absence of that you end up with what several of the harnesses ended up doing, where an agent will use a million tool calls to very slowly read a file in like 200 line chunks. I think they _might_ have fixed it now (or agent-fixes, my agent harness might be fixing it), but Codex used to do this and it made it unbelievably slow.
inetknght|14 days ago
On the other hand, I despise that it automatically pipes things through output-limiting things like `grep` with a filter, `head`, `tail`, etc. I would much rather it try to read a full grep and then decide to filter-down from there if the output is too large -- that's exactly what I do when I do the same workflow I told it to do.
Why? Because piping through output liming things can hide the scope of the "problem" I'm looking at. I'd rather see the scope of that first so I can decide if I need to change from a tactical view/approach to a strategic view/approach. It would be handy if the agents could do the same thing -- and I suppose they could if I'm a little more explicit about it in my tool/prompt.
cs702|14 days ago
Yeah, it's a well-known problem. Every AI company is working on ways to deal with it, one way or another, with clever data center design, and/or clever hardware and software engineering, and/or with clever algorithmic improvements, and/or with clever "agentic recursive LLM" workflows. Anything that actually works is treated like a priceless trade secret. Nothing that can put competitors at a disadvantage will get published any time soon.
There are academics who have been working on it too, most notably Tri Dao and Albert Gu, the key people behind FlashAttention and SSMs like Mamba. There are also lots of ideas out there for compressing the KV cache. No idea if any of them work. I also saw this recently on HN: https://news.ycombinator.com/item?id=46886265 . No idea if it works but the authors are credible. Agentic recursive LLMs look most promising to me right now. See https://arxiv.org/abs/2512.24601 for an intro to them.
yowlingcat|14 days ago
the_harpia_io|14 days ago
aleph_minus_one|14 days ago
If the abstraction that the code uses is "right", there will be hardly any edge cases, and something to break three layers deep.
Even though I am clearly an AI-hater, for this very specific problem I don't see the root cause in these AI models, but in the programmers who don't care about code quality and thus brutally reject code that is not of exceptional quality.
0xbadcafebee|14 days ago
This is what tests are for. Humans famously write crap code. They read it and assume they know what's going on, but actually they don't. Then they modify a line of code that looks like it should work, and it breaks 10 things. Tests are there to catch when it breaks so you can go back and fix it.
Agents are supposed to run tests as part of their coding loops, modifying the code until the tests pass. Of course reward hacking means the AI might modify the test to 'just pass' to get around this. So the tests need to be protected from the AI (in their own repo, a commit/merge filter, or whatever you want) and curated by humans. Initial creation by the AI based on user stories, but test modifications go through a PR process and are scrutinized. You should have many kinds of tests (unit, integration, end-to-end, regression, etc), and you can have different levels of scrutiny (maybe the AI can modify unit tests on the fly, and in PRs you only look at the test modifications to ensure they're sane). You can also have a different agent with a different prompt do a pre-review to focus only on looking for reward hacks.
netdevphoenix|14 days ago
I will imagine that in the future this will be tackled with a heavy driven approach and tight regulation of what the agent can and cannot touch. So frequent small PRs over big ones. Limit folder access to only those that need changing. Let it build the project. If it doesn't build, no PR submissions allowed. If a single test fails, no PR submissions allowed. And the tests will likely be the first if not the main focus in LLM PRs.
I use the term "LLM" and not "AI" because I notice that people have started attributing LLM related issues (like ripping off copyrighted material, excessive usage of natural resources, etc) to AI in general which is damaging for the future of AI.
qoez|14 days ago
apercu|14 days ago
kittbuilds|14 days ago
[deleted]
jamiemallers|14 days ago
[deleted]
aurareturn|14 days ago
Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.
Bugs that the AI agent would write, I would have also wrote. Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.
I also find that the AI writes more bug free code than I did. It handles cases that I wouldn’t have thought of. It used best practices more often than I did.
Maybe I was a bad dev before LLMs but I find myself producing better quality applications much quicker.
TZubiri|14 days ago
Most caching is done without hints from the application at this point, but I think some APIs are starting to take hints or explicit controls for keeping state associated with specific input tokens in memory, so these costs will go down, in essence you really don't reprocess the input token at inference, if you own the hardware it's quite trivial to infer one output token at a time, there's no additional cost, if you have 50k input tokens, and you generate 1 output token, it's not like you have to "reinfer" the 50k input tokens before you output the second token.
To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.
This is relevant in an application I'm working on where I check the logprobs and not always choose the most likely token(for example by implementing a custom logit_bias mechanism client-side), so you can infer 1 output token at a time. This is not quite possible with most APIs, but if you control the hardware and use (virtually) 0 cost cached tokens, you can do it.
So bottomline, cached input tokens are almost virtually free naturally (unless you hold them for a loong period of time), the price of cached input APIs is probably due to the lack of API negotiation as to what inputs you want to cache. As APIs and self-hosted solutions evolve, we will likely see the cost of cached inputs masssively drop down to almost 0. With efficient application programming the only accounting should be for output tokens and system prompts. Your output tokens shouldn't be charged again as inputs, at least not more than once.
NitpickLawyer|14 days ago
> the time it takes to generate the Millionth output token is the same as the first output token.
This is not true, even if you have the kv cache "hot" in vram. That's just not how transformers work.
> cached input tokens are almost virtually free naturally
No, they are not in practice. There are pure engineering considerations here. How do you route, when you evict kv cache, where you evict it to (RAM/nvme), how long you keep it, etc. At the scale of oAI/goog/anthropic these are not easy tasks, and the cost is definetly not 0.
Think about a normal session. A user might prompt something, wait for the result, re-prompt (you hit "hot" cache) and then go for a coffee. They come back 5 minutes later. You can't keep that in "hot" cache. Now you have to route the next message in that thread to a) a place where you have free "slots"; b) a place that can load the kv cache from "cold" storage and c) a place that has enough "room" to handle a possible max ctx request. These are not easy things to do in practice, at scale.
Now consider 100k users doing basically this, all day long. This is not free and can't become free.
mike_hearn|14 days ago
2001zhaozhao|14 days ago
Making caching free would price "100000 token cache, 1000 read, 1000 write" the same as "0 token cache, 1000 read, 1000 write", whereas the first one might cost more compute to run. I might be wrong at the scale of the effect here though.
lostmsu|14 days ago
This is wrong. Current models still use some full attention layers AFAIK, and their computational cost grows linearly (per token) with the token number.
eshaham78|14 days ago
[deleted]
alexhans|14 days ago
You'd think cost is an easy talking point to help people care but the starting points for people are so heterogeneous that it's tough to show them they can take control of this measurement themselves.
I say the latter because the article is a point in time and if they didn't have a recurrent observation around this, some aspects may radically change depending on the black box implementations of the integrations they depend on (or even the pricing strategies).
[1] https://ai-evals.io/
vatsachak|14 days ago
LLMs will have to eventually cross this hurdle before they become our replacements
unknown|14 days ago
[deleted]
seyz|14 days ago
collinwilkins|14 days ago
rubicon33|14 days ago
0-_-0|14 days ago
mzl|14 days ago
jauntywundrkind|14 days ago
mergisi|12 days ago
[deleted]
Agent_Builder|14 days ago
[deleted]
nivcmo|14 days ago
[deleted]
intellirim|14 days ago
intellirim|14 days ago
[deleted]
devcraft_ai|14 days ago
[deleted]
Areena_28|14 days ago
[deleted]
seanhunter|14 days ago
Edit to add: I see you are a new account and all your comments thus far are of a similar format, which seems highly suspicious. In the unlikely event you are a human, please read the hacker news guidelines https://news.ycombinator.com/newsguidelines.html
anvevoice|14 days ago
[deleted]
Tiberium|14 days ago
embedding-shape|14 days ago
Applies for everything with LLMs.
Somewhere along the idea, it seems like most people got the idea that "More text == better understanding" whereas reality seems to be the opposite, the less tokens you can give the LLM with only the absolute essentials, the better.
The trick is to find the balance, but "more == better" which many users seem to operate under seems to be making things worse, not better.
formerly_proven|14 days ago
Obviously you don't have to throw the data away, if the initial summary was missing some important detail, the agent can ask for additional information from a subthread/task/tool call.