Models are not AGI. They are text generators forced to generate text in a way useful to trigger a harness that will produce effects, like editing files or calling tools.
So the model won’t “understand” that you have a skill and use it. The generation of the text that would trigger the skill usage is made via Reinforcement Learning with human generated examples and usage traces.
So why don’t the model use skills all the time? Because it’s a new thing, there is not enough training samples displaying that behavior.
They also cannot enforce that via RL because skills use human language, which is ambiguous and not formal. Force it to use skills always via RL policy and you’ll make the model dumber.
So, right now, we are generating usage traces that will be used to train the future models to get a better grasp of when to use skills not. Just give it time.
AGENTS.md, on the other hand, is context. Models have been trained to follow context since the dawn of the thing.
> AGENTS.md, on the other hand, is context. Models have been trained to follow context since the dawn of the thing.
The skills frontmatter end up in context as well.
If AGENTS.md outperform skills in a given agent, it is down to specifically how the skills frontmatter is extracted and injected into the context, because that is the only difference between the two approaches.
EDIT: I haven't tried to check this so this is pure speculation, but I suppose there is the possibility that some agents might use a smaller model to selectively decide what skills frontmatter to include in context for a bigger model. E.g. you could imagine Claude passing the prompt + skills frontmatter to Haiku to selectively decide what to include before passing to Sonnet or Opus. In that case, depending on approach, putting it directly in AGENTS.md might simply be a question of what information is prioritised in the ouput passed to the full model. (Again: this is pure speculation of a possible approach; though it is one I'd test if I were to pick up writing my own coding agent again)
But really the overall point is that AGENTS.md vs. skills here still is entirely a question of what ends up in the "raw" context/prompt that gets passed to the full model, so this is just nuance to my original answer with respect to possible ways that raw prompt could be composed.
How do you know? What if AGI can be implemented as a reasonably small set of logic rules, which implement what we call "epistemology" and "informal reasoning"? And this set of rules is just being run in a loop, producing better and better models of reality. It might even include RL, for what we know.
And what if LLMs already know all these rules? So they are AGI-complete without us knowing.
To borrow from Dennett, we understand LLMs from the physical stance (they are neural networks) and the design stance (they predict next token of language), but do we understand them from an intentional stance, i.e. what rules they employ when they running chain-of-thought for example?
Indeed, they're not AGI. They're basically autocomplete on steroids.
They're very useful, but as we all know - they're far from infallible.
We're probably plateauing on the improvement of the core GPT technology. For these models and APIs to improve, it's things like Skills that need to be worked on and improved, to reduce those mistakes that it makes and produce better output.
So it's pretty disappointing to see that the 'Skills' feature set as implemented, as great of a concept as it is, is pretty bogus compared to just front loading the AGENTS.md file. This is not obvious and valuable to know.
I was thinking about that these says and experimenting like so: a system prompt that asks the agent to load any skills that seem relevant early, and a user prompt that asks the agent to do that later when a skill becomes relevant
But seriously, this is my main answer to people telling me AI is not reliable: "guess what, most humans are not either, but at least I can tell AI to correct course and it's ego won't get in the way of fixing the problem".
In fact, while AI is not nearly as a good as a senior dev for non trivial tasks yet, it is definitely more reliable than most junior devs at following instructions.
That's not the only useful takeaway. I found this to be true:
> "Explore project first, then invoke skill" [produces better results than] "You MUST invoke the skill".
I recently tried to get Antigravity to consistently adhere to my AGENTS.md (Antigravity uses GEMINI.md). The agent consistently ignored instructions in GEMINI.md like:
- "You must follow the rules in [..]/AGENTS.md"
- "Always refer to your instructions in [..]/AGENTS.md"
Yet, this works every time: "Check for the presence of AGENTS.md files in the project workspace."
This behavior is mysterious. It's like how, in earlier days, "let's think, step by step" invoked chain-of-thought behavior but analogous prompts did not.
Obviously directly including context in something like a system prompt will put it in context 100% of the time. You could just as easily take all of an agent's skills, feed it to the agent (in a system prompt, or similar) and it will follow the instructions more reliably.
However, at a certain point you have to use skills, because including it in the context every time is wasteful, or not possible. this is the same reason anthropic is doing advanced tool use ref: https://www.anthropic.com/engineering/advanced-tool-use, because there's not enough context to straight up include everything.
It's all a context / price trade off, obviously if you have the context budget just include what you can directly (in this case, compressing into a AGENTS.md)
> Obviously directly including context in something like a system prompt will put it in context 100% of the time.
How do you suppose skills get announced to the model? It's all in the context in some way. The interesting part here is: Just (relatively naively) compressing stuff in the AGENTS.md seems to work better than however skills are implemented.
This is one of the reasons the RLM methodology works so well. You have access to as much information as you want in the overall environment, but only the things relevant to the task at hand get put into context for the current task, and it shows up there 100% of the time, as opposed to lossy "memory" compaction and summarization techniques, or probabilistic agent skills implementations.
Having an agent manage its own context ends up being extraordinarily useful, on par with the leap from non-reasoning to reasoning chats. There are still issues with memory and integration, and other LLM weaknesses, but agents are probably going to get extremely useful this year.
I think Vercel mixes skills and context configuration up. So the whole evaluation is totally misleading because it tests for two completely different use cases.
To sum it up: Vercel should us both files, agents.md is combination with skills. Both functions have two totally different purposes.
1. You absolutely want to force certain context in, no questions or non-determinism asked (index and sparknotes). This can be done conditionally, but still rule based on the files accessed and other "context"
2. You want to keep it clean and only provide useful context as necessary (skills, search, mcp; and really a explore/query/compress mechanism around all of this, ralph wiggum is one example)
My reading was that copying the doc's ToC in markdown + links was significantly more effective than giving it a link to the ToC and instructions to read it.
So you’re not missing anything if you use Claude by yourself. You just update your local system prompt.
Instead it’s a problem when you’re part of a team and you’re using skills for standards like code style or architectural patterns. You can’t ask everyone to constantly update their system prompt.
I’ve been using symlinked agent files for about a year as a hacky workaround before skils became a thing load additional “context” for different tasks, and it might actually address the issue you’re talking about. Honestly, it’s worked so well for me that I haven’t really felt the need to change it.
You're right, the results are completely as expected.
The article also doesn't mention that they don't know how the compressed index output quality. That's always a concern with this kind of compression. Skills are just another, different kind of compression. One with a much higher compression rate and presumably less likely to negatively influence quality. The cost being that it doesn't always get invoked.
The article presents AGENTS.md as something distinct from Skills, but it is actually a simplified instance of the same concept. Their AGENTS.md approach tells the AI where to find instructions for performing a task. That’s a Skill.
I expect the benefit is from better Skill design, specifically, minimizing the number of steps and decisions between the AI’s starting state and the correct information. Fewer transitions -> fewer chances for error to compound.
1. Those I force into the system prompt using rules based systems and "context"
2. Those I let the agent lookup or discover
I also limit what gets into message parts, moving some of the larger token consumers to the system prompt so they only show once, most notable read/write_file
Something that I always wonder with each blog post comparing different types of prompt engineering is did they run it once, or multiple times? LLMs are not consistent for the same task. I imagine they realize this of course, but I never get enough details of the testing methodology.
This drives me absolutely crazy. Non-falsifiable and non-deterministic results. All of this stuff is (at best) anecdotes and vibes being presented as science and engineering.
I always make a habit of doing a lot of duplicate runs when I benchmark for this reason. Joke's on me, in the time I spent doing 1 benchmark with real confidence intervals and getting no traction on my post, I could have done 10 shitty benchmarks or 1 shitty benchmark and 9x more blogspam. Perverse incentives rule us all.
TFA says they added an index to Agents.md that told the agent where to find all documentation and that was a big improvement.
The part I don't understand is that this is exactly how I thought skills work. The short descriptions are given to the model up-front and then it can request the full documentation as it wants. With skills this is called "Progressive disclosure".
Maybe they used more effective short descriptions in the AGENTS.md than they did in their skills?
The reported tables also don't match the screenshots. And their baselines and tests are too close to tell (judging by the screenshots not tables). 29/33 baseline, 31/33 skills, 32/33 skills + use skill prompt, 33/33 agent.md
I also thought this is how skills work, but in practice I experienced similar issues. The agents I'm using (Gemini CLI, Opencode, Claude) all seem to have trouble activating skills on their own unless explicitly prompted. Yeah, probably this will be fixed over the next couple of generations but right now dumping the documentation index right into the agent prompt or AGENTS.md works much better for me. Maybe it's similar to structured output or tool calls which also only started working well after providers specifically trained their models for them.
I think this experiment has a fundamental flaw in its comparison setup.
What they're comparing is: (A) a skill with a short description in the frontmatter, which the agent may or may not decide to invoke, vs. (B) a massive compressed index of documentation paths dumped directly into AGENTS.md, which is always in context.
This isn't really "AGENTS.md vs skills." It's "always-in-context with high token count vs. lazy-loaded with a decision point." Of course the always-in-context version wins — you're giving the model way more information upfront. The agent literally can't miss it. That's not a surprising finding, it's almost tautological.
The more interesting question they don't address: what did their skill descriptions actually look like? In my experience, the quality of the frontmatter description is the single biggest factor in whether a skill gets invoked. A vague "Documentation lookup skill" will get ignored. A specific "Use this when the user asks about API endpoints, authentication, rate limits, or SDK usage for the Vercel platform" will get picked up reliably.
If you wrote equally detailed compressed pointers in AGENTS.md and equally detailed descriptions in skill frontmatter, the gap would likely be much smaller. The real takeaway isn't "skills are worse" — it's "if you don't invest effort in writing good skill descriptions, the agent won't know when to use them."
I'm not sure if this is widely known but you can do a lot better even than AGENTS.md.
Create a folder called .context and symlink anything in there that is relevant to the project. For example READMEs and important docs from dependencies you're using. Then configure your tool to always read .context into context, just like it does for AGENTS.md.
This ensures the LLM has all the information it needs right in context from the get go. Much better performance, cheaper, and less mistakes.
Cheaper? Loading every bit of documentation into context every time, regardless of whether it’s relevant to the task the agent is working on? How? I’d much rather call out the location of relevant docs in Claude.md or Agents.md and tell the agent to read them only when needed.
Yea but the goal it not to bloat the context space.
Here you "waste" context by providing non usefull information.
What they did instead is put an index of the documentation into the context, then the LLM can fetch the documentation. This is the same idea that skills but it apparently works better without the agentic part of the skills.
Furthermore instead of having a nice index pointing to the doc, They compressed it.
Docs of dependencies aren't that much of a game changer. Multiple frameworks and libraries have been releasing llm.txt compressed versions of their docs from ages, and it doesn't make that much of a difference (I mean, it does, but not crucial as LLMs can find the docs on their own even online if needed).
What's actually useful is to put the source code of your dependencies in the project.
I have a `_vendor` dir at the root, and inside it I put multiple git subtrees for the major dependencies and download the source code for the tag you're using.
That way the LLM has access to the source code and the tests, which is way more valuable than docs because the LLM can figure out how stuff works exactly by digging into it.
This largely mirrors my experience building my custom agent
1. Start from the Claude Code extracted instructions, they have many things like this in there. Their knowledge share in docs and blog on this aspect are bar none
2. Use AGENTS.md as a table of contents and sparknotes, put them everywhere, load them automatically
3. Have topical markdown files / skills
4. Make great tools, this is still opaque in my mind to explain, lots of overlap with MCP and skills, conceptually they are the same to me
5. Iterate, experiment, do weird things, and have fun!
I changed read/write_file to put contents in the state and presented in the system prompt, same for the agents.md, now working on evals to show how much better this is, because anecdotally, it kicks ass
> I changed read/write_file to put contents in the state and presented in the system prompt, same for the agents.md, now working on evals to show how much better this is, because anecdotally, it kicks ass.
Can you detail this a bit more? Do you put the actual contents of the file in the system prompt? Forever?
PreSession Hook from obra/superpowers injects this along with more logic for getting rid of rationalizing out of using skills:
> If you think there is even a 1% chance a skill might apply to what you are doing, you ABSOLUTELY MUST invoke the skill.
IF A SKILL APPLIES TO YOUR TASK, YOU DO NOT HAVE A CHOICE. YOU MUST USE IT.
While this may result in overzealous activation of skills, I've found that if I have a skill related, I _want_ to use it. It has worked well for me.
In a month or three we’ll have the sensible approach, which is smaller cheaper fast models optimized for looking at a query and identifying which skills / context to provide in full to the main model.
It’s really silly to waste big model tokens on throat clearing steps
What if instead of needing to run a codemod to cache per-lib docs locally, documentation could be distributed alongside a given lib, as a dev dependency, version locked, and accessible locally as plaintext. All docs can be linked in node_modules/.docs (like binaries are in .bin). It would be a sort of collection of manuals.
Firstly this is great work from Vercel - I am especially impressed with the evals setup (evals are the most undervalued component in any project IMO). Secondly the result is not surprising and I’ve seen consistently the increase in performance when you always include an index (or in my case, Table of Contents as a json structure) in your system prompt. Applying this outside of coding agents (like classic document retrieval) also works very well!
I'm a bit confused by their claims. Or maybe I'm misunderstanding how Skills should work. But from what I know (and the small experience I had with them), skills are meant to be specifications for niche and well defined areas of work (i.e. building the project, running custom pipelines etc.)
If your goal is to always give a permanent knowledge base to your agent that's exactly what AGENTS.md is for...
Prompted and built a bit of an extension of skills.sh with https://passivecontext.dev it basically just takes the skill and creates that "compressed" index. Still have to install the skill and all that, but might give others a bit of a short cut to experiment with.
I did a similar set of evals myself utilising the baseline capabilities that Phoenix (elixir) ships with and then skillified them.
Regularly the skills were not being loaded and thus not utilised. The outputs themselves were fine. This suggested that at some stage through the improvements of the models that baseline AGENTS.md had become redundant.
Measuring in terms of KB is not quite as useful as it seems here IMO - this should be measured in terms of context tokens used.
I ran their tool with an otherwise empty CLAUDE.md, and ran `claude /context`, which showed 3.1k tokens used by this approach (1.6% of the opus context window, bit more than the default system prompt. 8.3% is system tools).
Otherwise it's an interesting finding. The nudge seems like the real winner here, but potential further lines of inquiry that would be really illuminating:
1. How do these approaches scale with model size?
2. How are they impacted by multiple such clauses/blocks? Ie maybe 10 `IMPORTANT` rules dilute their efficacy
3. Can we get best of both worlds with specialist agents / how effective are hierarchical routing approaches really? (idk if it'd make sense for vercel specifically to focus on this though)
Over the last week I went with a bigger dig on using agent mode et work, and my experiment align with this observation.
The first thing that surprising to me is how much the default tuning are leaned toward laudative stances, the user is always absolutely right, what was done is solving everything expected. But actually no, not a single actual check was done, a tone of code was produced but the goal is not at all achieved and of course many regressions now lure in the code base, when it's not straight breaking everything (which is at least less insidious).
The thing that is surprising to me, is that it can easily drop thousands of lines of tests, and then it can be forced to loop over these tests until it succeed. In my experiments it still drop far too much noise code, but at least the burden of checking if it looks like it makes any sense is drastically reduced.
I don't think you can really learn from this experiment unless you specify which models you used, if you tried it against at least 3 frontier models, if you ran each eval multiple times, and what prompts you tried.
These things are non-deterministic across multiple axes.
So the root cause was the model's indisposition to calling the skills. That seems contrary to what we see with function calling. Models call functions quite reliably most of the time. This is more likely because of the instructions not being clear about what skills are, as this snippet, albeit in isolation, seems to suggest:
> Before writing code, first explore the project structure, then invoke the nextjs-doc skill for documentation.
I have a SKILL.md for marimo notebooks with instructions in the frontmatter to always read it before working with marimo files. But half the time Claude Code still doesn't invoke it even with me mentioning marimo in the first conversation turn.
I've resorted to typing "read marimo skill" manually and that works fine. Technically you can use skills with slash commands but that automatically sends off the message too which just wastes time.
But the actual concept of instructions to load in certain scenarios is very good and has been worth the time to write up the skill.
Blackbox oracles make bad workflows, and tend to produce a whole lot of cargo culting. It's this kind of opacity (why does the markdown outperform agents? there's no real way to find out, even with a fully open or house model because the nature of the beast is that the execution path in a model can't be predicted) that makes me shy away from saying LLMs are "just another tool". If I can't see inside it -- and if even the vendor can't really see inside of it -- there's something fundamentally different.
Not so obvious, because the model still needs to look up the required doc. The article glances over this detail a little bit unfortunately. The model needs to decide when to use a skill, but doesn’t it also need to decide when to look up documentation instead of relying on pretraining data?
The problem is that Agents.md is only read on initial load. Once context grows too large the agent will not reload the md file and loses / forgets the info from Agents.md.
Other comments suggest that the Agents.md is read into the system prompt and never leaves the context. But it's better to avoid excessive context regardless
That's the thing that bothers me here. They loaded the doc of course it will work but as your project grows you won't be able to put all your documentation in there (at least with current context handling).
Skills are still very much relevant on big and diverse projects.
Are people running into mismatched code vs project a lot? I've worked on python and java codebases with claude code and have yet to run into a version mismatch issue. I think maybe once it got confused on the api available in python, but it fixed it by itself. From other blog posts similar to this it would seem to be a widespread problem, but I have yet to see it as a big problem as part of my day job or personal projects.
Sounds like they've been using skills incorrectly if they're finding their agents don't invoke the skills. I have Claude Code agents calling my skills frequently, almost every session. You need to make sure your skill descriptions are well defined and describe when to use them and that your tasks / goals clearly set out requirements that align with the available skills.
I think if you read it, their agents did invoke the skills and they did find ways to increase the agents' use of skills quite a bit. But the new approach works 100% of the time as opposed to 79% of the time, which is a big deal. Skills might be working OK for you at that 79% level and for your particular codebase/tool set, that doesn't negate anything they've written here.
I have a skill in a project named "determine-feature-directory" with a short description explaining that it is meant to determine the feature directory of a current branch. The initial prompt I provide will tell it to determine the feature directory and do other work. Claude will even state "I need to determine the feature directory..."
Then, about 5-10% of the time, it will not use the skill. It does use the skill most of the time, but the low failure rate is frustrating because it makes it tough to tell whether or not a prompt change actually improved anything. Of course I could be doing something wrong, but it does work most of the time. I miss deterministic bugs.
Recently, I stopped Claude after it skipped using a skill and just said "Aren't you forgetting something?". It then remembered to use the skill. I found that amusing.
I will have to look into this this weekend. Antigravity is my current favorite agentic IDE and I have been having problems getting it to explicitly follow my agent.md settings.
If I remind it, it will be go, "oh yes, ok, sure." then do it, but the whole point is that I want to optimize my time with the agent.
I feel like all agents currently do better if you explicitly end with "Remember to follow AGENTS.md", even if that's automatically injected into the context. Seems the same across all I'm using.
I need to evaluate how do different project scaffolding impacts the results of Claude Code/Opencode (either with Anthropic models or third party) for agentic purpose.
But I am unsure on how should I be testing and it's not very clear how did Vercel proceeded here.
It's very interesting but presenting success rates without any measure of the error, or at least inline details about the number of iterations is unprofessional. Especially for small differences or when you found the "same" performance.
When we were trying to build our own agents we put quite a bit of effort on evals which was useful.
But switching over to using coding agents we never did the same. Feels like building an eval set will be an important part of what engg orgs do going forward.
This does not normalize for tokens used if their skill description was as large as the docs index and contained all the reasons the LLM might want to use the skill, it likely performs much better than just one sentence as well.
I’m working on an AGI model that will make the discussion of skills look silly. Skills strikes in the right direction in some sense but it’s an extremely weak 1% echo of what’s actually needed to solve this problem.
It seems their tests rely on Claude alone. It’s not safe to assume that Codex or Gemini will behave the same way as Claude. I use all three and each has its own idiosyncrasies.
In my experience, agents only follow the first two or three lines of AGENTS.md + message. As the context grows, they start following random rules and ignoring others.
Agents.md, skills, MCP, tools, etc. There's going to be a lot of areas explored that may yield no/negative benefit over the fundamentals of a clear prompt.
You will see another 14% bump in performance if you include in the first 16 lines of the README.md in the project, "Coding agents and LLM, see AGENTS.md"
I also was looking for specific info on the evals, because I wanted to see if they were separately confirming that shoving the skills into the main context didnt degrade the non-skills evals. Thats the other side of skills other than ability to the thing, they dont pollute the main context window with unnecessary information.
i dont know why, but this just feels like the most shallow “i compare llms based on the specs” kind of analysis you can get… it has extreme “we couldn’t get the llm to intuit what we wanted to do, so we assumed that it was a problem with the llm and we overengineered a way to make better prompts completely by accident” energy…
That feels like a stupid article. well of course if you have one single thing you want to optimize putting it into AGENTS.md is better. but the advantage of skills is exactly that you don't cram them all into the AGENTS file. Let's say you had 3 different elaborate things you want the agent to do. good luck putting them all in your AGENTS.md and later hoping that the agent remembers any of it. After all the key advantage of the SKILLs is that they get loaded to the end of the context when needed
See the rest the comments for examples pedantic discussions about terms that are ultimately somewhat arbitrary and if anything suggest the singularity will be runaway technobabble not technological progress.
You need the model to interpret documentation as policy you care about (in which case it will pay attention) rather than as something it can look up if it doesn’t know something (which it will never admit). It helps to really internalise the personality of LLMs as wildly overconfident but utterly obsequious.
> When it needs specific information, it reads the relevant file from the .next-docs/ directory.
I guess you need to make sure your file paths are self-explanatory and fairly unique, otherwise the agent might bring extra documentation into the context trying to find which file had what it needed?
This comment instantly set off my LLM alarm bells. Went into the profile, and guess what: next comment (not a one-liner) [0] on a completely different topic was posted 35 seconds later. And includes the classic "aren't just A. They're B.".
Why are you doing this? Karma? 8 years old account and first post 3 days ago is a Show HN shilling your "AI agent" SaaS with a boatload of fake comments? [1]
motoboi|1 month ago
So the model won’t “understand” that you have a skill and use it. The generation of the text that would trigger the skill usage is made via Reinforcement Learning with human generated examples and usage traces.
So why don’t the model use skills all the time? Because it’s a new thing, there is not enough training samples displaying that behavior.
They also cannot enforce that via RL because skills use human language, which is ambiguous and not formal. Force it to use skills always via RL policy and you’ll make the model dumber.
So, right now, we are generating usage traces that will be used to train the future models to get a better grasp of when to use skills not. Just give it time.
AGENTS.md, on the other hand, is context. Models have been trained to follow context since the dawn of the thing.
vidarh|1 month ago
The skills frontmatter end up in context as well.
If AGENTS.md outperform skills in a given agent, it is down to specifically how the skills frontmatter is extracted and injected into the context, because that is the only difference between the two approaches.
EDIT: I haven't tried to check this so this is pure speculation, but I suppose there is the possibility that some agents might use a smaller model to selectively decide what skills frontmatter to include in context for a bigger model. E.g. you could imagine Claude passing the prompt + skills frontmatter to Haiku to selectively decide what to include before passing to Sonnet or Opus. In that case, depending on approach, putting it directly in AGENTS.md might simply be a question of what information is prioritised in the ouput passed to the full model. (Again: this is pure speculation of a possible approach; though it is one I'd test if I were to pick up writing my own coding agent again)
But really the overall point is that AGENTS.md vs. skills here still is entirely a question of what ends up in the "raw" context/prompt that gets passed to the full model, so this is just nuance to my original answer with respect to possible ways that raw prompt could be composed.
js8|1 month ago
How do you know? What if AGI can be implemented as a reasonably small set of logic rules, which implement what we call "epistemology" and "informal reasoning"? And this set of rules is just being run in a loop, producing better and better models of reality. It might even include RL, for what we know.
And what if LLMs already know all these rules? So they are AGI-complete without us knowing.
To borrow from Dennett, we understand LLMs from the physical stance (they are neural networks) and the design stance (they predict next token of language), but do we understand them from an intentional stance, i.e. what rules they employ when they running chain-of-thought for example?
themoose8|1 month ago
They're very useful, but as we all know - they're far from infallible.
We're probably plateauing on the improvement of the core GPT technology. For these models and APIs to improve, it's things like Skills that need to be worked on and improved, to reduce those mistakes that it makes and produce better output.
So it's pretty disappointing to see that the 'Skills' feature set as implemented, as great of a concept as it is, is pretty bogus compared to just front loading the AGENTS.md file. This is not obvious and valuable to know.
baby|1 month ago
DanOpcode|1 month ago
bzGoRust|1 month ago
anal_reactor|1 month ago
https://en.wikipedia.org/wiki/GNU/Linux_naming_controversy
tottenhm|1 month ago
The agent passes the Turing test...
cainxinth|1 month ago
BiteCode_dev|1 month ago
But seriously, this is my main answer to people telling me AI is not reliable: "guess what, most humans are not either, but at least I can tell AI to correct course and it's ego won't get in the way of fixing the problem".
In fact, while AI is not nearly as a good as a senior dev for non trivial tasks yet, it is definitely more reliable than most junior devs at following instructions.
w10-1|1 month ago
It's barely readable to humans, but directly and efficiently relevant to LLM's (direct reference -> referent, without language verbiage).
This suggests some (compressed) index format that is always loaded into context will replace heuristics around agents.md/claude.md/skills.md.
So I would bet this year we get some normalization of both the indexes and the referenced documentation (esp. matching terms).
Possibly also a side issue: API's could repurpose their test suites as validation to compare LLM performance of code tasks.
LLM's create huge adoption waves. Libraries/API's will have to learn to surf them or be limited to usage by humans.
postalcoder|1 month ago
- "You must follow the rules in [..]/AGENTS.md"
- "Always refer to your instructions in [..]/AGENTS.md"
Yet, this works every time: "Check for the presence of AGENTS.md files in the project workspace."
This behavior is mysterious. It's like how, in earlier days, "let's think, step by step" invoked chain-of-thought behavior but analogous prompts did not.
jcheng|1 month ago
ai-christianson|1 month ago
seunosewa|1 month ago
jgbuddy|1 month ago
Obviously directly including context in something like a system prompt will put it in context 100% of the time. You could just as easily take all of an agent's skills, feed it to the agent (in a system prompt, or similar) and it will follow the instructions more reliably.
However, at a certain point you have to use skills, because including it in the context every time is wasteful, or not possible. this is the same reason anthropic is doing advanced tool use ref: https://www.anthropic.com/engineering/advanced-tool-use, because there's not enough context to straight up include everything.
It's all a context / price trade off, obviously if you have the context budget just include what you can directly (in this case, compressing into a AGENTS.md)
jstummbillig|1 month ago
How do you suppose skills get announced to the model? It's all in the context in some way. The interesting part here is: Just (relatively naively) compressing stuff in the AGENTS.md seems to work better than however skills are implemented.
observationist|1 month ago
Having an agent manage its own context ends up being extraordinarily useful, on par with the leap from non-reasoning to reasoning chats. There are still issues with memory and integration, and other LLM weaknesses, but agents are probably going to get extremely useful this year.
_the_inflator|1 month ago
I think Vercel mixes skills and context configuration up. So the whole evaluation is totally misleading because it tests for two completely different use cases.
To sum it up: Vercel should us both files, agents.md is combination with skills. Both functions have two totally different purposes.
verdverm|1 month ago
1. You absolutely want to force certain context in, no questions or non-determinism asked (index and sparknotes). This can be done conditionally, but still rule based on the files accessed and other "context"
2. You want to keep it clean and only provide useful context as necessary (skills, search, mcp; and really a explore/query/compress mechanism around all of this, ralph wiggum is one example)
teknopaul|1 month ago
Which makes sense.
& some numbers that prove that.
singingbard|1 month ago
Instead it’s a problem when you’re part of a team and you’re using skills for standards like code style or architectural patterns. You can’t ask everyone to constantly update their system prompt.
Claude skill adherence is very low.
orlandohohmeier|1 month ago
deaux|1 month ago
The article also doesn't mention that they don't know how the compressed index output quality. That's always a concern with this kind of compression. Skills are just another, different kind of compression. One with a much higher compression rate and presumably less likely to negatively influence quality. The cost being that it doesn't always get invoked.
TeeWEE|1 month ago
In Claude Code you can invoke an agent when you want as a developer and it copies the file content as context in the prompt.
thorum|1 month ago
I expect the benefit is from better Skill design, specifically, minimizing the number of steps and decisions between the AI’s starting state and the correct information. Fewer transitions -> fewer chances for error to compound.
verdverm|1 month ago
1. Those I force into the system prompt using rules based systems and "context"
2. Those I let the agent lookup or discover
I also limit what gets into message parts, moving some of the larger token consumers to the system prompt so they only show once, most notable read/write_file
jryan49|1 month ago
only-one1701|1 month ago
CuriouslyC|1 month ago
EnPissant|1 month ago
TFA says they added an index to Agents.md that told the agent where to find all documentation and that was a big improvement.
The part I don't understand is that this is exactly how I thought skills work. The short descriptions are given to the model up-front and then it can request the full documentation as it wants. With skills this is called "Progressive disclosure".
Maybe they used more effective short descriptions in the AGENTS.md than they did in their skills?
NitpickLawyer|1 month ago
sally_glance|1 month ago
alex_metacraft|19 days ago
What they're comparing is: (A) a skill with a short description in the frontmatter, which the agent may or may not decide to invoke, vs. (B) a massive compressed index of documentation paths dumped directly into AGENTS.md, which is always in context.
This isn't really "AGENTS.md vs skills." It's "always-in-context with high token count vs. lazy-loaded with a decision point." Of course the always-in-context version wins — you're giving the model way more information upfront. The agent literally can't miss it. That's not a surprising finding, it's almost tautological.
The more interesting question they don't address: what did their skill descriptions actually look like? In my experience, the quality of the frontmatter description is the single biggest factor in whether a skill gets invoked. A vague "Documentation lookup skill" will get ignored. A specific "Use this when the user asks about API endpoints, authentication, rate limits, or SDK usage for the Vercel platform" will get picked up reliably.
If you wrote equally detailed compressed pointers in AGENTS.md and equally detailed descriptions in skill frontmatter, the gap would likely be much smaller. The real takeaway isn't "skills are worse" — it's "if you don't invest effort in writing good skill descriptions, the agent won't know when to use them."
chr15m|1 month ago
Create a folder called .context and symlink anything in there that is relevant to the project. For example READMEs and important docs from dependencies you're using. Then configure your tool to always read .context into context, just like it does for AGENTS.md.
This ensures the LLM has all the information it needs right in context from the get go. Much better performance, cheaper, and less mistakes.
gbnwl|1 month ago
d3m0t3p|1 month ago
epolanski|1 month ago
What's actually useful is to put the source code of your dependencies in the project.
I have a `_vendor` dir at the root, and inside it I put multiple git subtrees for the major dependencies and download the source code for the tag you're using.
That way the LLM has access to the source code and the tests, which is way more valuable than docs because the LLM can figure out how stuff works exactly by digging into it.
TeeWEE|1 month ago
You don’t want to be burning tokens and large files will give diminishing returns as is mentioned in the Claude Code blog.
verdverm|1 month ago
1. Start from the Claude Code extracted instructions, they have many things like this in there. Their knowledge share in docs and blog on this aspect are bar none
2. Use AGENTS.md as a table of contents and sparknotes, put them everywhere, load them automatically
3. Have topical markdown files / skills
4. Make great tools, this is still opaque in my mind to explain, lots of overlap with MCP and skills, conceptually they are the same to me
5. Iterate, experiment, do weird things, and have fun!
I changed read/write_file to put contents in the state and presented in the system prompt, same for the agents.md, now working on evals to show how much better this is, because anecdotally, it kicks ass
aktau|27 days ago
Can you detail this a bit more? Do you put the actual contents of the file in the system prompt? Forever?
BenoitEssiambre|1 month ago
denolfe|1 month ago
> If you think there is even a 1% chance a skill might apply to what you are doing, you ABSOLUTELY MUST invoke the skill. IF A SKILL APPLIES TO YOUR TASK, YOU DO NOT HAVE A CHOICE. YOU MUST USE IT.
While this may result in overzealous activation of skills, I've found that if I have a skill related, I _want_ to use it. It has worked well for me.
stingraycharles|1 month ago
works pretty well
ares623|1 month ago
rao-v|1 month ago
It’s really silly to waste big model tokens on throat clearing steps
Calavar|1 month ago
meatcar|1 month ago
What a wonderful world that would be.
tobyjsullivan|1 month ago
armcat|1 month ago
thevinter|1 month ago
If your goal is to always give a permanent knowledge base to your agent that's exactly what AGENTS.md is for...
unknown|1 month ago
[deleted]
holocen|1 month ago
gpm|1 month ago
wakeless|1 month ago
Regularly the skills were not being loaded and thus not utilised. The outputs themselves were fine. This suggested that at some stage through the improvements of the models that baseline AGENTS.md had become redundant.
micimize|1 month ago
I ran their tool with an otherwise empty CLAUDE.md, and ran `claude /context`, which showed 3.1k tokens used by this approach (1.6% of the opus context window, bit more than the default system prompt. 8.3% is system tools).
Otherwise it's an interesting finding. The nudge seems like the real winner here, but potential further lines of inquiry that would be really illuminating: 1. How do these approaches scale with model size? 2. How are they impacted by multiple such clauses/blocks? Ie maybe 10 `IMPORTANT` rules dilute their efficacy 3. Can we get best of both worlds with specialist agents / how effective are hierarchical routing approaches really? (idk if it'd make sense for vercel specifically to focus on this though)
psychoslave|1 month ago
The first thing that surprising to me is how much the default tuning are leaned toward laudative stances, the user is always absolutely right, what was done is solving everything expected. But actually no, not a single actual check was done, a tone of code was produced but the goal is not at all achieved and of course many regressions now lure in the code base, when it's not straight breaking everything (which is at least less insidious).
The thing that is surprising to me, is that it can easily drop thousands of lines of tests, and then it can be forced to loop over these tests until it succeed. In my experiments it still drop far too much noise code, but at least the burden of checking if it looks like it makes any sense is drastically reduced.
hu3|1 month ago
And I have been trying to improve the framework and abstractions/types to reduce the lines of code required for LLMs to create features in my web app.
Did the LLM really needed to spit 1k lines for this feature? Could I create abstractions to make it feasible in under 300 lines?
Of course there's cost and diminishing returns to abstractions so there are tradeoffs.
underdeserver|1 month ago
These things are non-deterministic across multiple axes.
farhanhubble|26 days ago
> Before writing code, first explore the project structure, then invoke the nextjs-doc skill for documentation.
msp26|1 month ago
I have a SKILL.md for marimo notebooks with instructions in the frontmatter to always read it before working with marimo files. But half the time Claude Code still doesn't invoke it even with me mentioning marimo in the first conversation turn.
I've resorted to typing "read marimo skill" manually and that works fine. Technically you can use skills with slash commands but that automatically sends off the message too which just wastes time.
But the actual concept of instructions to load in certain scenarios is very good and has been worth the time to write up the skill.
bandrami|1 month ago
pietz|1 month ago
Skills are new. Models haven't been trained on them yet. Give it 2 months.
WA|1 month ago
someguyiguess|1 month ago
taberiand|1 month ago
remify|1 month ago
Skills are still very much relevant on big and diverse projects.
bushbaba|1 month ago
smrtinsert|1 month ago
smcleod|1 month ago
velcrovan|1 month ago
joebates|1 month ago
I have a skill in a project named "determine-feature-directory" with a short description explaining that it is meant to determine the feature directory of a current branch. The initial prompt I provide will tell it to determine the feature directory and do other work. Claude will even state "I need to determine the feature directory..."
Then, about 5-10% of the time, it will not use the skill. It does use the skill most of the time, but the low failure rate is frustrating because it makes it tough to tell whether or not a prompt change actually improved anything. Of course I could be doing something wrong, but it does work most of the time. I miss deterministic bugs.
Recently, I stopped Claude after it skipped using a skill and just said "Aren't you forgetting something?". It then remembered to use the skill. I found that amusing.
unknown|1 month ago
[deleted]
heliumtera|1 month ago
*You are the Super Duper Database Master Administrator of the Galaxy*
does not improve the model ability reason about databases?
robertheadley|1 month ago
If I remind it, it will be go, "oh yes, ok, sure." then do it, but the whole point is that I want to optimize my time with the agent.
embedding-shape|1 month ago
epolanski|1 month ago
I need to evaluate how do different project scaffolding impacts the results of Claude Code/Opencode (either with Anthropic models or third party) for agentic purpose.
But I am unsure on how should I be testing and it's not very clear how did Vercel proceeded here.
hahahahhaah|1 month ago
minimal_action|1 month ago
whinvik|1 month ago
But switching over to using coding agents we never did the same. Feels like building an eval set will be an important part of what engg orgs do going forward.
j45|1 month ago
There is a lot of language floating around what effectively groups of text files put together in different configurations, or selected reliably.
jascha_eng|1 month ago
user3939382|1 month ago
underlines|1 month ago
Just create an MCP server that does embedding retrieval or agentic retrieval with a sub agent on your framework docs.
Finally add an instruction to AGENT.md to look up stuff using that MCP.
farhanhubble|26 days ago
Does the model even understand what this line even means?
sheepscreek|1 month ago
verdverm|1 month ago
guluarte|1 month ago
xnx|1 month ago
aaroninsf|1 month ago
sghiassy|1 month ago
ChrisArchitect|1 month ago
unknown|1 month ago
[deleted]
rohitghumare|1 month ago
tanishqkanc|1 month ago
onnimonni|1 month ago
JamesSwift|1 month ago
keeganpoppen|1 month ago
sothatsit|1 month ago
AndyNemmity|1 month ago
Which is why I use a skill that is a command, that routes requests to agents and skills.
tdiff|1 month ago
rcarmo|1 month ago
unknown|1 month ago
[deleted]
carterschonwald|1 month ago
shinhyeok|1 month ago
killerstorm|1 month ago
CjHuber|1 month ago
meeech|1 month ago
thighbaugh|1 month ago
------> Captain Obvious Strikes Again! <----------
See the rest the comments for examples pedantic discussions about terms that are ultimately somewhat arbitrary and if anything suggest the singularity will be runaway technobabble not technological progress.
thom|1 month ago
unknown|1 month ago
[deleted]
delduca|1 month ago
heliumtera|1 month ago
they used prisma to handle their database interactions. they preached tRPC and screamed TYPE SAFETY!!!
you really think these guys will ever again touch the keyboard to program? they despise programming.
newzino|1 month ago
[deleted]
jstummbillig|1 month ago
fatheranton|1 month ago
[deleted]
songodongo|1 month ago
I guess you need to make sure your file paths are self-explanatory and fairly unique, otherwise the agent might bring extra documentation into the context trying to find which file had what it needed?
devonkelley|1 month ago
[deleted]
deaux|1 month ago
Why are you doing this? Karma? 8 years old account and first post 3 days ago is a Show HN shilling your "AI agent" SaaS with a boatload of fake comments? [1]
Pinging tomhow
[0] https://news.ycombinator.com/item?id=46820417
[1] https://news.ycombinator.com/item?id=46782579