I’ve been running a bunch of coding agents on benchmarks recently as part of consulting, and this is actually much more impressive than it seems at first glance.
71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.
But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark.
Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem.
If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.
Is there something in this multi-agent approach that makes the setup more specific to just the test at hand and less general to real engineering tasks? If not, then this multi-agent system will just become what you get out of the box in a future product. Multiple attempts per problem (as long as there's no human intervention or selection between them) is a perfectly fine approach for agents because that's not an issue from the perspective of an engineer using the product. A single agent is already a multi-step usage of LLMs and it sounds like this is just another meta level of that.
We need some international body to start running these tests… I just can’t trust these numbers any longer. We need a platform for this, something at least we can get some peer reviews
I’m working on this at STAC Research and looking to connect with others interested in helping. Key challenges are ensuring impartiality (and keeping it that way), making benchmarks ungameable, and guaranteeing reproducibility. We’ve done similar work in finance and are now applying the same principles to AI.
I've been using Warp for the past few weeks and it's been incredibly impressive over other agentic coding services/platforms. Curious how Qodo stacks up.
When I tried warp I was convinced that was where the industry was going (agents as terminal replacement), but it felt a bit too heavy to me so I haven’t been using it lately. Still think all things will converge on terminal and browser replacement.
I feel like the bash only SWE Bench Verified (a.k.a model + mini-swe-agent) is the closest thing to measuring the inherent ability of the model vs. the scaffolding.
If it's really better than Claude Code while using Sonnet 4.0, then I'd pay a monthly fee for it, but only if I can use my Claude subscription the same way Claude Code does.
I do not want to pay API charges or be limited to a fixed number of "credits" per month.
Slick. This applies to the new Qodo Command CLI, yes?
I updated to the latest version last night. Enjoyed seeing the process permission toggle (rwx). Was a refreshing change to keep the security minded folks less in panic with all the agentic coding adoptions :-)
I would be more interested in Qodo's performance on the swe-bench-multilingual benchmark. Swe-bench-verified only includes bugs related to python breakages.
The best submission is swe-bench-multilingual is Claude 3.7 Sonnet which solves ~43% of the issues in the dataset.
Does anyone have a benchmark on the effectiveness of using embeddings for mapping bug reports to code files as opposed to extensive grepping as Qodo, Cursor and a number of tools I use do to localize faults?
If Qodo are reading this: please introduce a plan that isn't for teams or enterprise. A "pro" plan for individuals who want more than 250 credits per month.
gronky_|6 months ago
71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.
But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark.
Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem.
If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.
thinkingtoilet|6 months ago
https://en.wikipedia.org/wiki/Goodhart%27s_law
energy123|6 months ago
terminalshort|6 months ago
oblio|6 months ago
eddd-ddde|6 months ago
szundi|6 months ago
Roritharr|6 months ago
ai-christianson|6 months ago
I.e. the agent cannot even know which tests are failing.
It has to both fix the issue based just on the issue text and fix it in the specific way the unit test, which it cannot see, expects.
For this reason I find the benchmark a little disconnected from the reality of software engineering.
khalic|6 months ago
redman25|6 months ago
Another approach might be the LiveBench approach where new tests are released on a regular basis.
jcorco|6 months ago
mupuff1234|6 months ago
I could understand focusing on a niche business use case, but coding is a main focus of the foundation models themselves.
M4R5H4LL|6 months ago
orangebread|6 months ago
lightbendover|6 months ago
itamarcode|6 months ago
I think that the next step is getting an official "checked" mark by the SWE bench team
whymauri|6 months ago
https://github.com/SWE-agent/mini-swe-agent
raylad|6 months ago
I do not want to pay API charges or be limited to a fixed number of "credits" per month.
esafak|6 months ago
lirantal|6 months ago
I updated to the latest version last night. Enjoyed seeing the process permission toggle (rwx). Was a refreshing change to keep the security minded folks less in panic with all the agentic coding adoptions :-)
zuzuen_1|6 months ago
The best submission is swe-bench-multilingual is Claude 3.7 Sonnet which solves ~43% of the issues in the dataset.
zuzuen_1|6 months ago
afro88|6 months ago
OldGreenYodaGPT|6 months ago
OldfieldFund|6 months ago
khalic|6 months ago
lightbendover|6 months ago
[deleted]
b0a04gl|6 months ago
[deleted]
rs186|6 months ago
https://news.ycombinator.com/item?id=44833929, my comment https://news.ycombinator.com/item?id=44835939