top | item 47041939

(no title)

lmeyerov | 13 days ago

We had a measurable shift when we started doing ai-coding loops driven by evals. By definition, the additions make the numbers go up-and-to-the-right. It's the epitomy of "you get what you measure" :)

Chaos Congress talk on this from a couple months ago, jump to the coding loops part: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t... . The talk focuses mostly on MCPs, but we now use the same flow for Skills.

This kind of experience makes me more hesitant to take on plugin and skill repos lacking evals or equivalent proving measurable quality over what the LLM knows and harness can handle. Generally a small number of things end up mattering majorly, but they end up being pivotal to get right, and the rest is a death by a thousand cuts.

discuss

No comments yet.