Context Rot: How increasing input tokens impacts LLM performance
260 points| kellyhongsn | 7 months ago |research.trychroma.com
TLDR: Model performance is non-uniform across context lengths, including state-of-the-art GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 models.
This highlights the need for context engineering. Whether relevant information is present in a model’s context is not all that matters; what matters more is how that information is presented.
Here is the complete open-source codebase to replicate our results: https://github.com/chroma-core/context-rot
[+] [-] posnet|7 months ago|reply
Especially with Gemini Pro when providing long form textual references, providing many documents in a single context windows gives worse answers than having it summarize documents first, ask a question about the summary only, then provide the full text of the sub-documents on request (rag style or just simple agent loop).
Similarly I've personally noticed that Claude Code with Opus or Sonnet gets worse the more compactions happen, it's unclear to me whether it's just the summary gets worse, or if its the context window having a higher percentage of less relevant data, but even clearing the context and asking it to re-read the relevant files (even if they were mentioned and summarized in the compaction) gives better results.
[+] [-] zwaps|7 months ago|reply
Long story short: Context engineering is still king, RAG is not dead
[+] [-] irskep|7 months ago|reply
The thing that would signal context rot is when you approach the auto-compact threshold. Am I thinking about this right?
[+] [-] bayesianbot|7 months ago|reply
[+] [-] tough|7 months ago|reply
[+] [-] lukev|7 months ago|reply
It's actually even more significant than it's possible to benchmark easily (though I'm glad this paper has done so.)
Truly useful LLM applications live at the boundaries of what the model can do. That is, attending to some aspect of the context that might be several logical "hops" away from the actual question or task.
I suspect that the context rot problem gets much worse for these more complex tasks... in fact, exponentially so for each logical "hop" which is required to answer successfully. Each hop compounds the "attention difficulty" which is increased by long/distracting contexts.
[+] [-] milchek|7 months ago|reply
The best results seem to be from clear, explicit instructions and plan up front for a discrete change or feature, with the relevant files to edit dragged into the context prompt.
[+] [-] elmean|7 months ago|reply
[+] [-] 0x457|7 months ago|reply
[+] [-] Workaccount2|7 months ago|reply
Instead I have a good instance going, but the model fumbles for 20k tokens and then that session heavily rotted. Let me cut it out!
[+] [-] aaronblohowiak|7 months ago|reply
[+] [-] snickerdoodle12|7 months ago|reply
LLMs-as-a-service don't offer this because it makes it trivial to bypass their censoring.
[+] [-] steveklabnik|7 months ago|reply
[+] [-] lordswork|7 months ago|reply
[+] [-] boesboes|7 months ago|reply
I'm sure it's all my poor prompting and context, but it really seems like Claude has lost 30 iq points last few weeks.
[+] [-] vevoe|7 months ago|reply
[+] [-] SketchySeaBeast|7 months ago|reply
Does this not feel like gaslighting we've all now internalized?
[+] [-] blixt|7 months ago|reply
One paper that stood out to me a while back was Many-Shot In-Context Learning[1] which showed large positive jumps in performance from filling the context with examples.
As always, it’s important to test one’s problem to know how the LLM changes in behavior for different context contents/lengths — I wouldn’t assume a longer context is always worse.
[1] https://arxiv.org/pdf/2404.11018
[+] [-] orbital-decay|7 months ago|reply
ICL is a phenomenon separate from long-context performance degradation, they can coexist, similarly to how lost-in-the-middle affects the performance of examples in different positions just as fine.
[+] [-] zwaps|7 months ago|reply
Media literacy disclaimer: Chroma is a vectorDB company.
[+] [-] philip1209|7 months ago|reply
[+] [-] tjkrusinski|7 months ago|reply
[+] [-] lifthrasiir|7 months ago|reply
[+] [-] elevaet|7 months ago|reply
[+] [-] magicalhippo|7 months ago|reply
I've noticed this issue as well with smaller local models that have relatively long contexts, say a 8B model with 128k context.
I imagined they performed special recall training for these long context models, but the results seem... not so great.
[+] [-] jpcompartir|7 months ago|reply
My hunch would be that even if we had a lot more annotated examples of reasoning and retrieval over 10,000+ tokens, the architectures we have today would still be limited.
[+] [-] namibj|7 months ago|reply
[+] [-] mikeve|7 months ago|reply
[+] [-] psadri|7 months ago|reply
https://www.notion.so/LLM-Context-Engineering-21b814d6a64980...
Some of these are in use in an in-house AI chat application that has a heavy emphasis on tool calls.
[+] [-] namibj|7 months ago|reply
It may be that dimension-starved pretrained transformer models rely heavily on context being correctly "tagged" in all relevant aspects the very instant it's inserted into the KV cache, e.g. necessitating negation to be prefixed to a fact instead of allowing post-fix negation. The common LLM chat case is telling the model it just spewed hallucination/wrong claims, and hoping this will help instead of hurt downstream performance as the chat continues. There specifically the negation is very delayed, and thus not present in most tokens that code the hallucinated claims in the KV cache, and thus for lack of sufficient positional precision due to insufficient dimensionality, the transformer can't retroactively attribute the "that was wrong" claim in a retrievable matter to the hallucination tokens.
The result of course being the behavior we experience: hallucinations are corrected by editing the message that triggered them to include discouraging words, as otherwise the thread will become near-useless from the hallucination context pollution.
I do wonder if we have maybe figured out how to do this more scalable than just naively raising the query dimension to get (back?) closer to sequence length.
[0]: https://arxiv.org/abs/2002.07028
[+] [-] tough|7 months ago|reply
[+] [-] kelsey98765431|7 months ago|reply
[+] [-] jgalt212|7 months ago|reply
[+] [-] jsemrau|7 months ago|reply
[+] [-] kbelder|7 months ago|reply
[+] [-] citizenAlex|7 months ago|reply
[deleted]
[+] [-] unknown|7 months ago|reply
[deleted]