(no title)
nielstron | 13 days ago
Regarding the 4% improvement for human written AGENTS.md: this would be huge indeed if it were a _consistent_ improvement. However, for example on Sonnet 4.5, performance _drops_ by over 2%. Qwen3 benefits most and GPT-5.2 improves by 1-2%.
The LLM-generated prompts follow the coding agent recommendations. We also show an ablation over different prompt types, and none have consistently better performance.
But ultimately I agree with your post. In fact we do recommend writing good AGENTS.md, manually and targetedly. This is emphasized for example at the end of our abstract and conclusion.
vidarh|13 days ago
My use of CLAUDE.md is to get Claude to avoid making stupid mistakes that will require subsequent refactoring or cleanup passes.
Performance is not a consideration.
If anything, beyond CLAUDE.md I add agent harnesses that often increase the time and tokens used many times over, because my time is more expensive than the agents.
_joel|13 days ago
[1] https://github.com/gsd-build/get-shit-done
yorwba|13 days ago
sdenton4|13 days ago
bee_rider|13 days ago
regularfry|13 days ago
Ok so that's interesting in itself. Apologies if you go into this in the paper, not had time to read it yet, but does this tell us something about the models themselves? Is there a benchmark lurking here? It feels like this is revealing something about the training, but I'm not sure exactly what.
nielstron|13 days ago
deaux|13 days ago
> The LLM-generated prompts follow the coding agent recommendations. We also show an ablation over different prompt types, and none have consistently better performance.
I think the coding agent recommended LLM-generated AGENTS.md files are almost without exception really bad. Because the AGENTS.md, to perform well, needs to point out the _non_-obvious. Every single LLM-generated AGENTS.md I've seen - including by certain vendors who at one point in time out-of-the-box included automatic AGENTS.md generation - wrote about the obvious things! The literal opposite of what you want. Indeed a complete and utter waste of tokens that does nothing but induce context rot.
I believe this is because creating a good one consumes a massive amount of resources and some engineering for any non-trivial codebase. You'd need multiple full-context iterations, and a large number of thinking tokens.
On top of that, and I've said this elsewhere, most of the best stuff to put in AGENTS.md is things that can't be inferred from the repo. Things like "Is this intentional?", "Why is this the case?" and so on. Obviously, the LLM nor a new-to-the-project human could know this or add them to the file. And the gains from this are also hard to capture by your performance metric, because they're not really about the solving of issues, they're often about direction, or about the how rather than the what.
As for the extra tokens, the right AGENTS.md can save lots of tokens, but it requires thinking hard about them. Which system/business logic would take the agent 5 different file reads to properly understand, but can we summarize in 3 sentences?
nielstron|13 days ago
Note with different prompt types I refer to different types of meta-prompts to generate the AGENTS.md. All of these are quite useless. Some additional experiments not in the paper showed that other automated approaches are also useless ("memory" creating methods, broadly speaking).