I read the study. I think it does the opposite of what the authors suggest - it's actually vouching for good AGENTS.md files.
> Surprisingly, we observe that developer-provided files only
marginally improve performance compared to omitting
them entirely (an increase of 4% on average), while LLM-
generated context files have a small negative effect on agent
performance (a decrease of 3% on average).
This "surprisingly", and the framing seems misplaced.
For the developer-made ones: 4% improvement is massive! 4% improvement from a simple markdown file means it's a must-have.
> while LLM-
generated context files have a small negative effect on agent
performance (a decrease of 3% on average)
This should really be "while the prompts used to generate AGENTS files in our dataset..". It's a proxy for prompts, who knows if the ones generated through a better prompt show improvement.
The biggest usecase for AGENTS.md files is domain knowledge that the model is not aware of and cannot instantly infer from the project. That is gained slowly over time from seeing the agents struggle due to this deficiency. Exactly the kind of thing very common in closed-source, yet incredibly rare in public Github projects that have an AGENTS.md file - the huge majority of which are recent small vibecoded projects centered around LLMs. If 4% gains are seen on the latter kind of project, which will have a very mixed quality of AGENTS files in the first place, then for bigger projects with high-quality .md's they're invaluable when working with agents.
Regarding the 4% improvement for human written AGENTS.md: this would be huge indeed if it were a _consistent_ improvement. However, for example on Sonnet 4.5, performance _drops_ by over 2%. Qwen3 benefits most and GPT-5.2 improves by 1-2%.
The LLM-generated prompts follow the coding agent recommendations. We also show an ablation over different prompt types, and none have consistently better performance.
But ultimately I agree with your post. In fact we do recommend writing good AGENTS.md, manually and targetedly. This is emphasized for example at the end of our abstract and conclusion.
In Theory There Is No Difference Between Theory and Practice, While In Practice There Is.
In large projects, having a specific AGENTS.md makes the difference between the agent spending half of its context window searching for the right commands, navigating the repo, understanding what is what, etc., and being extremely useful. The larger the repository, the more things it needs to be aware of and the more important the AGENTS.md is. At least that's what I have observed in practice.
> The biggest usecase for AGENTS.md files is domain knowledge that the model is not aware of and cannot instantly infer from the project. That is gained slowly over time from seeing the agents struggle due to this deficiency.
This. I have Claude write about the codebase because I get tired of it grepping files constantly. I rather it just know “these files are for x, these files have y methods” and I even have it breakdown larger files so it fits the entire context window several times over.
Funnily enough this makes it easier for humans to parse.
This reads a lot like bargaining stage. If agentic AI makes me a 10 times more productive developer, surely a 4% improvement is barely worth the token cost.
Well said. And it's potentially a 7% swing when you think about it — +4% with good human-written context vs. -3% with LLM-generated noise. That's a significant delta from just the quality of the information.
The real value is exactly what you described: the tribal knowledge, the "we tried X and it broke because Y", the constraints that live in someone's head and nowhere in the code. LLM-generated files miss this because the LLM is just restating what it can already see. Of course that doesn't help.
Honestly, the more research papers I read, the more I am suspicious. This "surprisingly" and other hyperbole is just to make reviewers think the authors actually did something interesting/exciting. But the more "surprises" there are in a paper, the more I am suspicious of it. Often such hyperbole ought to be at best ignored, at worst the exact opposite needs to be examined.
It seems like the best students/people eventually end up doing CS research in their spare time while working as engineers. This is not the case for many other disciplines, where you need e.g. a lab to do research. But in CS, you can just do it from your basement, all you need is a laptop.
This is why I only add information to AGENTS.md when the agent has failed at a task. Then, once I've added the information, I revert the desired changes, re-run the task, and see if the output has improved. That way, I can have more confidence that AGENTS.md has actually improved coding agent success, at least with the given model and agent harness.
I do not do this for all repos, but I do it for the repos where I know that other developers will attempt very similar tasks, and I want them to be successful.
That's a sensible approach, but it still won't give you 100% confidence. These tools produce different output even when given the same context and prompt. You can't really be certain that the output difference is due to isolating any single variable.
Agree. I also found out that rule discovery approach like this perform better. It is like teaching a student, they probably have already performed well on some task, if we feed in another extra rule that they already well verse at, it can hinder their creativity.
When an agent just plows ahead with a wrong interpretation or understanding of something, I like to ask them why they didn't stop to ask for clarification.
Just a few days ago, while refactoring minor stuff, I had an agent replace all sqlite-related code in that codebase with MariaDB-based code.
Asked why that happened, the answer was that there was a confusion about MariaDB vs. sqlite because the code in question is dealing with, among other things, MariaDB Docker containers. So the word MariaDB pops up a few times in code and comments.
I then asked if there is anything I could do to prevent misinterpretations from producing wild results like this.
So I got the advice to put an instruction in AGENTS.md that would urge agents to ask for clarification before proceeding.
But I didn't add it. Out of the 25 lines of my AGENTS.md, many are already variations of that.
The first three:
- Do not try to fill gaps in your knowledge with overzealous assumptions.
- When in doubt: Slow down, double-check context, and only touch what was explicitly asked for.
- If a task seems to require extra changes, pause and ask before proceeding.
If these are not enough to prevent stuff like that, I don't know what could.
Are agents actually capable of answering why they did things? An LLM can review the previous context, add your question about why it did something, and then use next token prediction to generate an answer. But is that answer actually why the agent did what it did?
Isn't that question a category error? The "why" the agent did that is that it was the token that best matched the probability distribution of the context and the most recent output (modulo a bit of randomness). The response to that question will, again, be the tokens that best match the probability distribution of the context (now including the "why?" question and the previous failed attempt).
Just this morning I have run across an even narrower case of how AGENTS.md (in this case with GPT-5.3 Codex) can be completely ignored even if filled with explicit instructions.
I have a line there that says Codex should never use Node APIs where Bun APIs exist for the same thing. Routinely, Claude Code and now Codex would ignore this.
I just replaced that rule with a TypeScript-compiler-powered AST based deterministic rule. Now the agent can attempt to commit code with banned Node API usage and the pre-commit script will fail, so it is forced to get it right.
I've found myself migrating more and more of my AGENTS.md instructions to compiler-based checks like these - where possible. I feel as though this shouldn't be needed if the models were good, but it seems to be and I guess the deterministic nature of these checks is better than relying on the LLM's questionable respect of the rules.
I really hate that the anthropomorphizing of these systems has successfully taken hold in people's brains. Asking it why it did something is completely useless because you aren't interrogating a person with a memory or a rationale, you’re querying a statistical model that is spitting out a justification for a past state it no longer occupies.
Even the "thinking" blocks in newer models are an illusion. There is no functional difference between the text in a thought block and the final answer. To the model, they are just more tokens in a linear sequence. It isn't "thinking" before it speaks, the "thought" is the speech.
Treating those thoughts as internal reflection of some kind is a category error. There is no "privileged" layer of reasoning happening in the silicon that then gets translated into the thought block. It’s a specialized output where the model is forced to show its work because that process of feeding its own generated strings back into its context window statistically increases the probability of a correct result. The chatbot providers just package this in a neat little window to make the model's "thinking" part of the gimmick.
I also wouldn't be surprised if asking it stuff like this was actually counter productive, but for this I'm going off vibes. The logic being that by asking that, you're poisoning the context, similar to how if you try generate an image by saying "It should not have a crocodile in the image", it will put a crocodile into the image. By asking it why it did something wrong, it'll treat that as the ground truth and all future generation will have that snippet in it, nudging the output in such a way that the wrong thing itself will influence it to keep doing the wrong thing more and more.
It seems like LLMs in general still have a very hard time with the concepts of "doubt" and "uncertainty". In the early days this was very visible in the form of hallucinations, but it feels like they fixed that mostly by having better internal fact-checking. The underlying problem of treating assumptions as truth is still there, just hidden better.
My personal experience is that it’s worthwhile to put instructions, user-manual style, into the context. These are things like:
- How to build.
- How to run tests.
- How to work around the incredible crappiness of the codex-rs sandbox.
I also like to put in basic style-guide things like “the minimum Python version is 3.12.” Sadly I seem to also need “if you find yourself writing TypeVar, think again” because (unscientifically) it seems that putting the actual keyword that the agent should try not to use makes it more likely to remember the instructions.
I also try to avoid negative instructions. No scientific proof, just a feeling the same as you, "do not delete the tmp file" can lead too often to deleting the tmp file.
I also have felt like these kinds of efforts at instructions and agent files have been worthwhile, but I am increasingly of the opinion that such feelings represent self-delusion from seeing and expecting certain things aided by a tool that always agrees with my, or its, take on utility. The agent.md file looks like it’d work, it looks how you’d expect, but then it fails over and over. And the process of tweaking is pleasant chatting with supportive supposed insights and solutions, which means hours of fiddling with meta-documentation without clear rewards because of partial adherence.
The papers conclusions align with my personal experiments at managing a small knowledge base with LLM rules. The application of rules was inconsistent, the execution of them fickle, and fundamental changes in processing would happen from week-to-week as the model usage was tweaked. But, rule tweaking always felt good. The LLM said it would work better, and the LLM said it had read and understood the instructions and the LLM said it would apply them… I felt like I understoood how best to deliver data to the LLMs, only to see recurrent failures.
LLMs lie. They have no idea, no data, and no insights into specific areas, but they’ll make pleasant reality-adjacent fiction. Since chatting is seductive, and our time sense is impacted by talking, I think the normal time versus productivity sense is further pulled out of ehack. Devs are notoriously bad at estimating where they’re using time, long feedback loops filled with phone time and slow ass conversation don’t help.
Quite a surprising result: “across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%.”
Yesterday while i was adding some nitpicks to a CLAUDE.md/AGENTS.md file, I thought « this file could be renamed CONTRIBUTING.md and be done with it ».
Maybe I’m wrong but sure feels like we might soon drop all of this extra cruft for more rationale practices
Exactly, it's the same documentation any contributor would need, just actually up-to-date and pared down to the essentials because it's "tested" continuously. If I were starting out on a new codebase, AGENTS.md is the first place I'd look to get my bearings.
This paper validates what we've been building toward. The core issue isn't the idea of context files — it's that prose is the wrong format for structured facts.
AI crushes structured data like package.json but struggles with free-form markdown. Two developers describe the same repo completely differently. There's no schema, no validation, no scoring.
Our paper on CERN's Zenodo proposes FAF — a structured YAML format (IANA-registered as application/vnd.faf+yaml) that replaces prose with validated fields. One .faf file generates native outputs for CLAUDE.md, AGENTS.md, .cursorrules, and GEMINI.md. The instruction files stay — they just sit on top of a structured foundation instead of floating independently.
LLMs are generally bad at writing non-noisy prompts and instructions. It's better to have it write instructions post hoc. For instance, I paste this prompt into the end of most conversations:
If there’s a nugget of knowledge learned at any point in this conversation (not limited to the most recent exchange), please tersely update AGENTS.md so future agents can access it. If nothing durable was learned, no changes are needed. Do not add memories just to add memories.
Update AGENTS.md **only** if you learned a durable, generalizable lesson about how to work in this repo (e.g., a principle, process, debugging heuristic, or coding convention). Do **not** add bug- or component-specific notes (for example, “set .foo color in bar.css”) unless they reflect a broader rule.
If the lesson cannot be stated without referencing a specific selector or file, skip the memory and make no changes. Keep it to **one short bullet** under an appropriate existing section, or add a new short section only if absolutely necessary.
It hardly creates rules, but when it does, it affects rules in a way that positively affects behavior. This works very well.
Another common mistake is to have very long AGENTS.md files. The file should not be long. If it's longer than 200 lines, you're certainly doing it wrong.
> If nothing durable was learned, no changes are needed.
Off topic, but oh my god if you don't do this, it will always do the thing you conditionally requested it to do. Not sure what to call this but it's my one big annoyance with LLMs.
It's like going to a sub shop and asking for just a tiny bit of extra mayo and they heap it on.
I'd be interested to see results with Opus 4.6 or 4.5
Also, I bet the quality of these docs vary widely across both human and AI generated ones. Good Agents.md files should have progressive disclosure so only the items required by the task are pulled in (e.g. for DB schema related topics, see such and such a file).
Then there's the choice of pulling things into Agents.md vs skills which the article doesn't explore.
I do feel for the authors, since the article already feels old. The models and tooling around them are changing very quickly.
Agree that progressive disclosure is fantastic, but
> (e.g. for DB schema related topics, see such and such a file).
Rather than doing this, put another AGENTS.md file in a DB-related subfolder. It will be automatically pulled into context when the agent reads any files in the file. This is supported out of the box by any agent worth its salt, including OpenCode and CC.
IMO static instructions referring an LLM to other files are an anti-pattern, at least with current models. This is a flaw of the skills spec, which refers to creating a "references" folder and such. I think initial skills demos from Anthropic also showed this. This doesn't work.
Progressive disclosure is good for reducing context usage but it also reduces the benefit of token caching. It might be a toss-up, given this research result.
Any well-maintained project should already have a CONTRIBUTING.md that has good information for both humans and agents.
Sometimes I actually start my sessions like this "please read the contributing.md file to understand how to build/test this project before making any code changes"
I only put things when the LLM gets something wrong and I need to correct it. Like “no, we create db migrations using this tool” kind of corrections. So far it made them behave correctly in those situations.
Their definition of context excludes prescriptive specs/requirements files. They are only talking about a file that summarizes what exists in the codebase, which is information that's otherwise discoverable by the agent through CLI (ripgrep, etc), and it's been trained to do that as efficiently as possible.
Also important to note that human-written context did help according to them, if only a little bit.
Effectively what they're saying is that inputting an LLM generated summary of the codebase didn't help the agent. Which isn't that surprising.
I find it surprising. The piece of code I'm working on is about 10k LoC to define the basic structures and functionality and I found Claude Code would systematically spend significant time and tokens exploring it to add even basic functionality. Part of the issue is this deals with a problem domain LLMs don't seem to be very well trained on, so they have to take it all in, they don't seem to know what to look for in advance.
I went through a couple of iterations of the CLAUDE.md file, first describing the problem domain and library intent (that helped target search better as it had keywords to go by; note a domain-trained human would know these in advance from the three words that comprise the library folder name) and finally adding a concise per-function doc of all the most frequently used bits. I find I can launch CC on a simple task now, without it spending minutes reading the codebase before getting started.
Hey, a paper author here :)
I agree, if you know well about LLMs it shouldn't be too surprising that autogenerated context files are not helping - yet this is the default recommendation by major AI companies which we wanted to scrutinize.
> Their definition of context excludes prescriptive specs/requirements files.
Can you explain a bit what you mean here? If the context file specifies a desired behavior, we do check whether the LLM follows it, and this seems generally to work (Section 4.3).
- Don't state the obvious: I wouldn't hand a senior human dev a copy of "Clean Code" before every ticket and expect them to work faster.
- File vs. Prompt is a false dichotomy: The paper treats "Context Files" as a separate entity, but technically, an AGENTS.md is just a system prompt injection. The mechanism is identical. The study isn't proving that "files are bad," it's proving that "context stuffing" is bad. Whether I paste the rules manually or load them via a file, the transformer sees the same tokens.
- Latent vs. Inferable Knowledge: This is the key missing variable. If I remove context files, my agents fail at tasks requiring specific process knowledge - like enforcing strict TDD or using internal wrapper APIs that aren't obvious from public docs. The agent can't "guess" our security protocols or architectural constraints. That's not a performance drag; it's a requirement. The paper seems to conflate "adding noise" with "adding constraints."
each role owns specific files. no overlap means zero merge conflicts across 1800+ autonomous PRs. planning happens in `.sys/plans/{role}/` as written contracts before execution starts. time is the mutex.
AGENTS.md defines the vision. agents read the gap between vision and reality, then pull toward it. no manager, no orchestration.
agents ship features autonomously. 90% of PRs are zero human in the loop. the one pain point is refactors. cross-cutting changes don't map cleanly to single-role ownership
AGENTS.md works when it encodes constraints that eliminate coordination. if it's just a roadmap, it won't help much.
I'd take any paper like this with a grain of salt. I imagine what holds true for models in time period X could drastically be different just given a little more time.
Doesn't mean it's not worth studying this kind of stuff, but this conclusion is already so "old" that it's hard to say it's valid anymore with the latest batch of models.
I use AGENTS.md daily for my personal AI setup. The biggest win is giving the agent project-specific context — things like deployment targets, coding conventions, and what not to do. Without it, the agent makes generic assumptions that waste time.
In my experience AGENTS.md files only save a bit of time, they don't meaningfully improve success. Agents are smart enough to figure stuff out on their own, but you can save a few tool calls and a bit of context by telling them how to build your project or what directories do what rather than letting it stumble its way there.
What is the purpose of an AGENTS.md file when there are so many different models? Which model or version of the model is the file written for? So much depends on assumptions here. It only makes sense when you know exactly which model you are writing for. No wonder the impact is 'all over the place'.
Many of the practices in this field are mostly based on feelings and wishful thinking, rather than any demonstrable benefit. Part of the problem is that the tools are practically nondeterministic, and their results can't be compared reliably.
The other part is fueled by brand recognition and promotion, since everyone wants to make their own contribution with the least amount of effort, and coming up with silly Markdown formats is an easy way to do that.
EDIT: It's amusing how sensitive the blue-pilled crowd is when confronted with reality. :)
If I understand the paper correctly, the researchers found that AGENTS.md context files caused the LLMs to burn through more tokens as they parsed and followed the instructions, but they did not find a large change in the success rate (defined by "the PR passes the existing unit tests in the repo").
What wasn't measured, probably because it's almost impossible to quantify, was the quality of the code produced. Did the context files help the LLMs produce code that matched the style of the rest of the project? Did the code produced end up reasonably maintainable in the long run, or was it slop that increased long-term tech debt? These are important questions, but as they are extremely difficult to assign numbers to and measure in an automated way, the paper didn't attempt to answer them.
The only thing I use CLAUDE.md for is explaining the purpose and general high level design principles of the project so I don't have to waste my time reiterating this every time I clear the context. Things like this is a file manager, the deliverable must always be a zipapp, Wayland will never be supported.
I added these to that file because otherwise I will have to tell claude these things myself, repeatedly. But the science says... Respectfully, blow it out your ass.
Research has shown that most earlier "techniques" to get better LLM response no longer work and are actively harmful with modern models. I'm so grateful that there's actual studies and papers about this and that they keep coming out. Software developers are super cargo culty and will do whatever the next guy does (and that includes doing whatever is suggested in research papers)
Software developers don't have to be cargo-culty... if they're working on systems that are well-documented or are open-source (or at least source-available) so that you can actually dig in to find out how the system works.
But with LLMs, the internals are not well-documented, most are not open-source (and even if the model and weights are open-source, it's impossible for a human to read a grid of numbers and understand exactly how it will change its output for a given input), and there's also an element of randomness inherent to how the LLM behaves.
Given that fact, it's not surprising to find that developers trying to use LLMs end up adding certain inputs out of what amounts to superstition ("it seems to work better when I tell it to think before coding, so let's add that instruction and hopefully it'll help avoid bad code" but there's very little way to be sure that it did anything). It honestly reminds me of gambling fallacies, e.g. tabletop RPG players who have their "lucky" die that they bring out for important rolls. There's insufficient input to be sure that this line, which you add to all your prompts by putting it in AGENTS.md, is doing anything — but it makes you feel better to have it in there.
(None of which is intended as a criticism, BTW: that's just what you have to do when using an opaque, partly-random tool).
Most of these AI-guiding "techniques" seem more like reading into tea leaves to me than anything actually useful.
Even with the latest and greatest (because I know people will reflexively immediately jump down my throat if I don't specify that, yes, I've used Opus 4.6 and Gemini 3 Pro etc. etc. etc. etc., I have access to all of the models by way of work and use them regularly), my experience has been that it's basically a crapshoot that it'll listen to a single one of these files, especially in the long run with large chats. The amount of times I still have to tell these things to not generate React in my Vue codebase that has literally not a single line of JSX anywhere and instructions in every single possible file I can put it in to NOT GENERATE FUCKING REACT CODE makes me want to blow my brains out every time it happens. In fact it happened to me today with the supposed super intelligence known as Opus 4.6 that has 18 trillion TB of context or whatever in a fresh chat when I asked for a quick snippet I needed to experiment with.
I'm not even paying for this crap (work is) and I still feel scammed approximately half the time, and can't help but think all of these suggestions are just ways to inflate token usage and to move you into the usage limit territory faster.
Claude/Opus 4.6 Can you add a console.log in food XYZ?
No problem, x agents, hundreds/closed to one million token usage to add a line of code.
Gemini 3 : can you review the commit A (console.log one ) you have made the most significant change in your 200kloc code base, this key change will allow you to get great insight into your software.
Codex : I have reviewed your change, you are missing tests and integration tests.
But I fully agree, overall I feel there are a lot of tea leaves readers online and LinkedIn.
What are you putting in the file? When I’ve looked at them they just looked like a second readme file without the promotional material in a typical GitHub readme.
deaux|12 days ago
> Surprisingly, we observe that developer-provided files only marginally improve performance compared to omitting them entirely (an increase of 4% on average), while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average).
This "surprisingly", and the framing seems misplaced.
For the developer-made ones: 4% improvement is massive! 4% improvement from a simple markdown file means it's a must-have.
> while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average)
This should really be "while the prompts used to generate AGENTS files in our dataset..". It's a proxy for prompts, who knows if the ones generated through a better prompt show improvement.
The biggest usecase for AGENTS.md files is domain knowledge that the model is not aware of and cannot instantly infer from the project. That is gained slowly over time from seeing the agents struggle due to this deficiency. Exactly the kind of thing very common in closed-source, yet incredibly rare in public Github projects that have an AGENTS.md file - the huge majority of which are recent small vibecoded projects centered around LLMs. If 4% gains are seen on the latter kind of project, which will have a very mixed quality of AGENTS files in the first place, then for bigger projects with high-quality .md's they're invaluable when working with agents.
nielstron|12 days ago
Regarding the 4% improvement for human written AGENTS.md: this would be huge indeed if it were a _consistent_ improvement. However, for example on Sonnet 4.5, performance _drops_ by over 2%. Qwen3 benefits most and GPT-5.2 improves by 1-2%.
The LLM-generated prompts follow the coding agent recommendations. We also show an ablation over different prompt types, and none have consistently better performance.
But ultimately I agree with your post. In fact we do recommend writing good AGENTS.md, manually and targetedly. This is emphasized for example at the end of our abstract and conclusion.
SerCe|12 days ago
In large projects, having a specific AGENTS.md makes the difference between the agent spending half of its context window searching for the right commands, navigating the repo, understanding what is what, etc., and being extremely useful. The larger the repository, the more things it needs to be aware of and the more important the AGENTS.md is. At least that's what I have observed in practice.
giancarlostoro|12 days ago
This. I have Claude write about the codebase because I get tired of it grepping files constantly. I rather it just know “these files are for x, these files have y methods” and I even have it breakdown larger files so it fits the entire context window several times over.
Funnily enough this makes it easier for humans to parse.
bootsmann|12 days ago
wolfejam|4 days ago
The real value is exactly what you described: the tribal knowledge, the "we tried X and it broke because Y", the constraints that live in someone's head and nowhere in the code. LLM-generated files miss this because the LLM is just restating what it can already see. Of course that doesn't help.
zero_k|12 days ago
It seems like the best students/people eventually end up doing CS research in their spare time while working as engineers. This is not the case for many other disciplines, where you need e.g. a lab to do research. But in CS, you can just do it from your basement, all you need is a laptop.
pgt|12 days ago
pamelafox|13 days ago
I do not do this for all repos, but I do it for the repos where I know that other developers will attempt very similar tasks, and I want them to be successful.
viraptor|12 days ago
imiric|12 days ago
averrous|13 days ago
avhception|12 days ago
I then asked if there is anything I could do to prevent misinterpretations from producing wild results like this. So I got the advice to put an instruction in AGENTS.md that would urge agents to ask for clarification before proceeding. But I didn't add it. Out of the 25 lines of my AGENTS.md, many are already variations of that. The first three:
- Do not try to fill gaps in your knowledge with overzealous assumptions.
- When in doubt: Slow down, double-check context, and only touch what was explicitly asked for.
- If a task seems to require extra changes, pause and ask before proceeding.
If these are not enough to prevent stuff like that, I don't know what could.
Sevii|12 days ago
bandrami|12 days ago
tomashubelbauer|12 days ago
I have a line there that says Codex should never use Node APIs where Bun APIs exist for the same thing. Routinely, Claude Code and now Codex would ignore this.
I just replaced that rule with a TypeScript-compiler-powered AST based deterministic rule. Now the agent can attempt to commit code with banned Node API usage and the pre-commit script will fail, so it is forced to get it right.
I've found myself migrating more and more of my AGENTS.md instructions to compiler-based checks like these - where possible. I feel as though this shouldn't be needed if the models were good, but it seems to be and I guess the deterministic nature of these checks is better than relying on the LLM's questionable respect of the rules.
unknown|12 days ago
[deleted]
geraneum|12 days ago
You may want to ask the next LLM versions the same question after they feed this paper through training.
sensanaty|12 days ago
Even the "thinking" blocks in newer models are an illusion. There is no functional difference between the text in a thought block and the final answer. To the model, they are just more tokens in a linear sequence. It isn't "thinking" before it speaks, the "thought" is the speech.
Treating those thoughts as internal reflection of some kind is a category error. There is no "privileged" layer of reasoning happening in the silicon that then gets translated into the thought block. It’s a specialized output where the model is forced to show its work because that process of feeding its own generated strings back into its context window statistically increases the probability of a correct result. The chatbot providers just package this in a neat little window to make the model's "thinking" part of the gimmick.
I also wouldn't be surprised if asking it stuff like this was actually counter productive, but for this I'm going off vibes. The logic being that by asking that, you're poisoning the context, similar to how if you try generate an image by saying "It should not have a crocodile in the image", it will put a crocodile into the image. By asking it why it did something wrong, it'll treat that as the ground truth and all future generation will have that snippet in it, nudging the output in such a way that the wrong thing itself will influence it to keep doing the wrong thing more and more.
lebuin|12 days ago
mustaphah|12 days ago
delaminator|12 days ago
"You're absolutely correct. I should have checked my skills before doing that. I'll make sure I do it in the future."
amluto|13 days ago
- How to build.
- How to run tests.
- How to work around the incredible crappiness of the codex-rs sandbox.
I also like to put in basic style-guide things like “the minimum Python version is 3.12.” Sadly I seem to also need “if you find yourself writing TypeVar, think again” because (unscientifically) it seems that putting the actual keyword that the agent should try not to use makes it more likely to remember the instructions.
mlaretallack|13 days ago
likium|13 days ago
bonesss|12 days ago
The papers conclusions align with my personal experiments at managing a small knowledge base with LLM rules. The application of rules was inconsistent, the execution of them fickle, and fundamental changes in processing would happen from week-to-week as the model usage was tweaked. But, rule tweaking always felt good. The LLM said it would work better, and the LLM said it had read and understood the instructions and the LLM said it would apply them… I felt like I understoood how best to deliver data to the LLMs, only to see recurrent failures.
LLMs lie. They have no idea, no data, and no insights into specific areas, but they’ll make pleasant reality-adjacent fiction. Since chatting is seductive, and our time sense is impacted by talking, I think the normal time versus productivity sense is further pulled out of ehack. Devs are notoriously bad at estimating where they’re using time, long feedback loops filled with phone time and slow ass conversation don’t help.
medler|13 days ago
tartakovsky|13 days ago
Languages == Python only
Libraries (um looks like other LLM generated libraries -- I mean definitely not pure human: like Ragas, FastMCP, etc)
So seems like a highly skewed sample and who knows what can / can't be generalized. Does make for a compelling research paper though!
rmnclmnt|13 days ago
Maybe I’m wrong but sure feels like we might soon drop all of this extra cruft for more rationale practices
nielstron|12 days ago
gordonhart|12 days ago
wolfejam|4 days ago
AI crushes structured data like package.json but struggles with free-form markdown. Two developers describe the same repo completely differently. There's no schema, no validation, no scoring.
Our paper on CERN's Zenodo proposes FAF — a structured YAML format (IANA-registered as application/vnd.faf+yaml) that replaces prose with validated fields. One .faf file generates native outputs for CLAUDE.md, AGENTS.md, .cursorrules, and GEMINI.md. The instruction files stay — they just sit on top of a structured foundation instead of floating independently.
Paper: https://zenodo.org/records/18251362
prodigycorp|12 days ago
Another common mistake is to have very long AGENTS.md files. The file should not be long. If it's longer than 200 lines, you're certainly doing it wrong.
joquarky|12 days ago
Off topic, but oh my god if you don't do this, it will always do the thing you conditionally requested it to do. Not sure what to call this but it's my one big annoyance with LLMs.
It's like going to a sub shop and asking for just a tiny bit of extra mayo and they heap it on.
pajtai|13 days ago
Also, I bet the quality of these docs vary widely across both human and AI generated ones. Good Agents.md files should have progressive disclosure so only the items required by the task are pulled in (e.g. for DB schema related topics, see such and such a file).
Then there's the choice of pulling things into Agents.md vs skills which the article doesn't explore.
I do feel for the authors, since the article already feels old. The models and tooling around them are changing very quickly.
deaux|12 days ago
> (e.g. for DB schema related topics, see such and such a file).
Rather than doing this, put another AGENTS.md file in a DB-related subfolder. It will be automatically pulled into context when the agent reads any files in the file. This is supported out of the box by any agent worth its salt, including OpenCode and CC.
IMO static instructions referring an LLM to other files are an anti-pattern, at least with current models. This is a flaw of the skills spec, which refers to creating a "references" folder and such. I think initial skills demos from Anthropic also showed this. This doesn't work.
dpkirchner|13 days ago
kkapelon|12 days ago
Any well-maintained project should already have a CONTRIBUTING.md that has good information for both humans and agents.
Sometimes I actually start my sessions like this "please read the contributing.md file to understand how to build/test this project before making any code changes"
CharlieDigital|12 days ago
benreesman|12 days ago
Think of the agent app store people's children man, it would be a sad Christmas.
eknkc|13 days ago
energy123|13 days ago
Also important to note that human-written context did help according to them, if only a little bit.
Effectively what they're saying is that inputting an LLM generated summary of the codebase didn't help the agent. Which isn't that surprising.
MITSardine|12 days ago
I went through a couple of iterations of the CLAUDE.md file, first describing the problem domain and library intent (that helped target search better as it had keywords to go by; note a domain-trained human would know these in advance from the three words that comprise the library folder name) and finally adding a concise per-function doc of all the most frequently used bits. I find I can launch CC on a simple task now, without it spending minutes reading the codebase before getting started.
nielstron|13 days ago
> Their definition of context excludes prescriptive specs/requirements files.
Can you explain a bit what you mean here? If the context file specifies a desired behavior, we do check whether the LLM follows it, and this seems generally to work (Section 4.3).
fwystup|11 days ago
- Don't state the obvious: I wouldn't hand a senior human dev a copy of "Clean Code" before every ticket and expect them to work faster.
- File vs. Prompt is a false dichotomy: The paper treats "Context Files" as a separate entity, but technically, an AGENTS.md is just a system prompt injection. The mechanism is identical. The study isn't proving that "files are bad," it's proving that "context stuffing" is bad. Whether I paste the rules manually or load them via a file, the transformer sees the same tokens.
- Latent vs. Inferable Knowledge: This is the key missing variable. If I remove context files, my agents fail at tasks requiring specific process knowledge - like enforcing strict TDD or using internal wrapper APIs that aren't obvious from public docs. The agent can't "guess" our security protocols or architectural constraints. That's not a performance drag; it's a requirement. The paper seems to conflate "adding noise" with "adding constraints."
GBintz|12 days ago
each role owns specific files. no overlap means zero merge conflicts across 1800+ autonomous PRs. planning happens in `.sys/plans/{role}/` as written contracts before execution starts. time is the mutex.
AGENTS.md defines the vision. agents read the gap between vision and reality, then pull toward it. no manager, no orchestration.
we wrote about it here: https://agnt.one/blog/black-hole-architecture
agents ship features autonomously. 90% of PRs are zero human in the loop. the one pain point is refactors. cross-cutting changes don't map cleanly to single-role ownership
AGENTS.md works when it encodes constraints that eliminate coordination. if it's just a roadmap, it won't help much.
bavell|12 days ago
"The system does not assign tasks.
It defines gravity."
Helios looks cool though!
theLiminator|13 days ago
Doesn't mean it's not worth studying this kind of stuff, but this conclusion is already so "old" that it's hard to say it's valid anymore with the latest batch of models.
nielstron|12 days ago
flatcoke|12 days ago
mindwok|12 days ago
einrealist|12 days ago
climike|12 days ago
[deleted]
4b11b4|12 days ago
https://github.com/ash-project/usage_rules
Razengan|12 days ago
BlueHotDog2|12 days ago
imiric|13 days ago
The other part is fueled by brand recognition and promotion, since everyone wants to make their own contribution with the least amount of effort, and coming up with silly Markdown formats is an easy way to do that.
EDIT: It's amusing how sensitive the blue-pilled crowd is when confronted with reality. :)
rmunn|12 days ago
What wasn't measured, probably because it's almost impossible to quantify, was the quality of the code produced. Did the context files help the LLMs produce code that matched the style of the rest of the project? Did the code produced end up reasonably maintainable in the long run, or was it slop that increased long-term tech debt? These are important questions, but as they are extremely difficult to assign numbers to and measure in an automated way, the paper didn't attempt to answer them.
ozim|12 days ago
unknown|12 days ago
[deleted]
unknown|13 days ago
[deleted]
unknown|12 days ago
[deleted]
reconnecting|12 days ago
alexvay|5 days ago
mikkupikku|12 days ago
I added these to that file because otherwise I will have to tell claude these things myself, repeatedly. But the science says... Respectfully, blow it out your ass.
DaanDL|12 days ago
0xbadcafebee|13 days ago
rmunn|12 days ago
But with LLMs, the internals are not well-documented, most are not open-source (and even if the model and weights are open-source, it's impossible for a human to read a grid of numbers and understand exactly how it will change its output for a given input), and there's also an element of randomness inherent to how the LLM behaves.
Given that fact, it's not surprising to find that developers trying to use LLMs end up adding certain inputs out of what amounts to superstition ("it seems to work better when I tell it to think before coding, so let's add that instruction and hopefully it'll help avoid bad code" but there's very little way to be sure that it did anything). It honestly reminds me of gambling fallacies, e.g. tabletop RPG players who have their "lucky" die that they bring out for important rolls. There's insufficient input to be sure that this line, which you add to all your prompts by putting it in AGENTS.md, is doing anything — but it makes you feel better to have it in there.
(None of which is intended as a criticism, BTW: that's just what you have to do when using an opaque, partly-random tool).
sensanaty|12 days ago
Even with the latest and greatest (because I know people will reflexively immediately jump down my throat if I don't specify that, yes, I've used Opus 4.6 and Gemini 3 Pro etc. etc. etc. etc., I have access to all of the models by way of work and use them regularly), my experience has been that it's basically a crapshoot that it'll listen to a single one of these files, especially in the long run with large chats. The amount of times I still have to tell these things to not generate React in my Vue codebase that has literally not a single line of JSX anywhere and instructions in every single possible file I can put it in to NOT GENERATE FUCKING REACT CODE makes me want to blow my brains out every time it happens. In fact it happened to me today with the supposed super intelligence known as Opus 4.6 that has 18 trillion TB of context or whatever in a fresh chat when I asked for a quick snippet I needed to experiment with.
I'm not even paying for this crap (work is) and I still feel scammed approximately half the time, and can't help but think all of these suggestions are just ways to inflate token usage and to move you into the usage limit territory faster.
Foobar8568|12 days ago
No problem, x agents, hundreds/closed to one million token usage to add a line of code.
Gemini 3 : can you review the commit A (console.log one ) you have made the most significant change in your 200kloc code base, this key change will allow you to get great insight into your software.
Codex : I have reviewed your change, you are missing tests and integration tests.
But I fully agree, overall I feel there are a lot of tea leaves readers online and LinkedIn.
Arifcodes|12 days ago
[deleted]
Gigachad|12 days ago
Arifcodes|12 days ago
[deleted]
octoclaw|12 days ago
[deleted]
kittbuilds|12 days ago
[deleted]
AlexYzhov|12 days ago
[deleted]
szundi|12 days ago
[deleted]
noahmagel|5 days ago
[deleted]