top | item 44874831

(no title)

gronky_ | 6 months ago

I’ve been running a bunch of coding agents on benchmarks recently as part of consulting, and this is actually much more impressive than it seems at first glance.

71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.

But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark.

Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem.

If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.

discuss

order

thinkingtoilet|6 months ago

This is classic Goodhart's law. "When a measure becomes a target, it ceases to be a good measure"

https://en.wikipedia.org/wiki/Goodhart%27s_law

ambicapter|6 months ago

It's really not that hard to not build a custom bench setup to game the benchmark instead of just using your product straight out of the box, though.

clutchdude|6 months ago

Also see the VW dieselgate and numerous other "gaming the system" examples.

kelipso|6 months ago

A specific setup for the benchmark is just plain cheating, not Goodhart’s law.

energy123|6 months ago

What are the typical context lengths in SWE-bench problems? Does it partly measure performance in the 64-128k context range?

dimitri-vs|6 months ago

IIRC the SWE bench dataset gives you the full repo snapshot + the issue text, the evaluation pipelines typically run some kind of retriever (eg. grep, BM25) to pick a subset of files to place in the model’s context. They provided context is usually limited up to ~50k tokens.

terminalshort|6 months ago

Is there something in this multi-agent approach that makes the setup more specific to just the test at hand and less general to real engineering tasks? If not, then this multi-agent system will just become what you get out of the box in a future product. Multiple attempts per problem (as long as there's no human intervention or selection between them) is a perfectly fine approach for agents because that's not an issue from the perspective of an engineer using the product. A single agent is already a multi-step usage of LLMs and it sounds like this is just another meta level of that.

eddd-ddde|6 months ago

I think multiple attempts are completely understandable and even expected? How is that defeating the purpose of the benchmark?

gronky_|6 months ago

It’s a pass@1 benchmark. When submitting you need to check a box that there was only 1 attempt per problem. See here for example: https://github.com/SWE-bench/experiments/pull/219

Building multiple attempts into your agent is stretching the rules, even if technically it’s acceptable

szundi|6 months ago

According to your experience with this model, is it just trained for the benchmark or these points are actually representing the performance?

Roritharr|6 months ago

Finally someone mentions Refact, I was in contact with the team, rooting for them really.

bluelightning2k|6 months ago

Just looked them up. Their pricing is around buying "coins" with no transparency as to what that gets. Hard pass

ai-christianson|6 months ago

One thing with SWE bench is making sure there's zero leakage of information into the LLM context.

I.e. the agent cannot even know which tests are failing.

It has to both fix the issue based just on the issue text and fix it in the specific way the unit test, which it cannot see, expects.

For this reason I find the benchmark a little disconnected from the reality of software engineering.