(no title)
postalcoder | 26 days ago
codex + skills finetunes Qwen3-0.6B to +6 on humaneval and beats the base score on the first run.
I reran the experiment from this week, but used codex's new skills integration. Like claude code, codex consumes the full skill into context and doesn't start with failing runs. It's first run beats the base score, and on the second run it beats claude code.
https://xcancel.com/ben_burtenshaw/status/200023306951767675...That said, it's not a perfect comparison because of the Codex model mismatch between runs.
The author seems to be doing a lot of work on skills evaluation.
iainmerrick|26 days ago
To be clear, I'm suggesting that any specific format for "skills.md" is a red herring, and all you need to do is provide the LLM with good clear documentation.
A useful comparison would be between: a) make a carefully organised .skills/ folder, b) put the same info anywhere and just link to it from your top-level doc, c) just dump everything directly in the top-level doc.
My guess is that it's probably a good idea to break stuff out into separate sections, to avoid polluting the context with stuff you don't need; but the specific way you do that very likely isn't important at all. So (a) and (b) would perform about the same.
postalcoder|26 days ago
My guess is that the standardization is going to make its way into how the models are trained and Skills are eventually going to pull out ahead.
0: https://vercel.com/blog/agents-md-outperforms-skills-in-our-...
dragonwriter|25 days ago
Agent Skills isn't a spec for how information is presented to the model, its a spec whose consumer is the model harness, which might present information made available to it in the format to the model in different ways for different harnesses, or even in the same harness for different models or tasks, considering things like the number and size of the skill(s) available, the size of the model context, the purpose of the harness (is it for a narrow purpose agent where some of the skills are central to that purpose?), and user preference settings.
The site itself has two different main styles of integration for harnesses described ("tool based" and "filesystem based"), but those are more of a starting point for implementers that an exhaustive listing.
The idea is that skill authors don't need to know or care how the harness is presenting the information to the model.
anupamchugh|26 days ago
pton_xd|26 days ago
9dev|26 days ago
Plus, as has been mentioned multiple times here, standard skills are a lot more about different harnesses being able to consistently load skills into the context window in a programmatic way. Not every AI workload is a local coding agent.
mbesto|26 days ago
dragonwriter|25 days ago
(1) providing a bash tool with direct access to the filesystem storing the skills to the model,
(2) providing read_file and related tools to the model,
(3) by providing specialized tools to access skills to the model,
(4) by processing the filesystem structure and providing a structure that includes the full content of the skills up front to the model.
And probably some other ways or hybrids.
> It increases benchmarks a few points now but what's the point in standardizing all this if it'll be obsolete next year?
Standardizing the information presentation of skills to LLM harnesses lets the harnesses incorporate findings on optimization (which may be specific to models, or at least model features like context size, and use cases) and existing skills getting the benefit of that for free.
unknown|26 days ago
[deleted]
xrd|26 days ago
I am very interested in finding ways to combine skills + local models + MCP + aider-ish tools to avoid using commercial LLM providers.
Is this a path to follow? Or, something different?
postalcoder|26 days ago
https://xcancel.com/ben_burtenshaw
https://huggingface.co/blog/upskill
https://github.com/huggingface/upskill
oofbey|25 days ago
bburtenshaw|25 days ago
we wrote a blog on getting agents to write CUDA kernels and evaluating them: https://huggingface.co/blog/upskill
8cvor6j844qw_d6|26 days ago