(no title)
gronky_ | 6 months ago
71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.
But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark.
Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem.
If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.
thinkingtoilet|6 months ago
https://en.wikipedia.org/wiki/Goodhart%27s_law
ambicapter|6 months ago
clutchdude|6 months ago
kelipso|6 months ago
energy123|6 months ago
whymauri|6 months ago
https://huggingface.co/datasets/princeton-nlp/SWE-bench_Veri...
Its up to your retrieval system/model to selectively hunt for relevant context. Here's a few critiques of the benchy:
https://x.com/brhydon/status/1953648884309536958
dimitri-vs|6 months ago
terminalshort|6 months ago
oblio|6 months ago
eddd-ddde|6 months ago
gronky_|6 months ago
Building multiple attempts into your agent is stretching the rules, even if technically it’s acceptable
szundi|6 months ago
unknown|6 months ago
[deleted]
Roritharr|6 months ago
bluelightning2k|6 months ago
ai-christianson|6 months ago
I.e. the agent cannot even know which tests are failing.
It has to both fix the issue based just on the issue text and fix it in the specific way the unit test, which it cannot see, expects.
For this reason I find the benchmark a little disconnected from the reality of software engineering.