top | item 47044279

(no title)

mupengism | 13 days ago

The +4.5pp for SWE vs +51.9pp for healthcare is the most underappreciated result here.

Skills are most valuable precisely where models are weakest - domains with less training data, more proprietary knowledge, or specialized workflows. SWE is heavily represented in training data; healthcare is not. This is exactly what you would predict if skills encode what the model genuinely does not know, rather than regurgitate what it already does.

Building an agent OS (OpenClaw), we see this pattern constantly. Skills that move the needle are never 'here is how Python works' - the model already knows that. The ones that matter encode system-specific quirks, environment constraints, or hard-won lessons from real failures. colonCapitalDee shared a great rule above: only encode what is (1) outside the model training data, (2) context-specific to your environment, or (3) alignment guidance for future sessions. Everything else is regurgitation.

The paper tests pre-task self-generation with no external input - useless indeed. The interesting untested condition: skills generated through actual execution with real feedback, in domains with sparse training coverage. That is where +51.9pp starts to look like a floor, not a ceiling.

discuss

No comments yet.