In 2024, we developed SWE-bench and SWE-agent at Princeton University and helped kickstart the coding agent revolution.
Back then, LMs were optimized to be great at chatting, but not much else. This meant that agent scaffolds had to get very creative (and complicated) to make LMs perform useful work.
But in 2025, LMs are actively optimized for agentic coding, and we ask:
*What the simplest coding agent that could still score near SotA on the benchmarks?*
*Turns out, it just requires 100 lines of code!*
And this system still *resolves 65% of all GitHub issues in the SWE-bench verified benchmark* with Sonnet 4 (for comparison, when Anthropic launched Sonnet 4, they reported 70% with their own scaffold that was never made public).
Honestly, we're all pretty stunned ourselves—we've now spent more than a year developing SWE-agent, and would not have thought that such a small system could perform nearly as good.
I'll link to the project below (all open-source, of course). The hello world example is incredibly short & simple (and literally what gave us the 65%). But it is also meant as a serious command line tool + research project, so we provide a Claude-code style UI & some utilities on top of that.
We have some team members from Princeton/Stanford here today, let us know if you have any questions/feedback :)
Is there an option to learn from mistakes? most coding agents I tried, including the Sonnet 4 based one will make same mistake again and again in a new chat.
It would be great to have the agent adding a memory (even locally) to avoid mistakes, checking for new versions of libraries, and write a list of tasks first before the execution (similar to Kiro and Trae SOLO).
I was just starting to study coding agent implementation, specifically with tool use. Seeing the insight on the README that `bash` is all a modern LLM needs to solve coding tasks was very interesting, since the trend seems to be solidly toward tools.
Being able to read the entire agent code nearly on a single screen is very instructive and inspiring to start hacking.
One thing I'm curious about is API calling efficiency. Did you happen to compare request count or token consumption of the mini agent versus full sized? Is that data available generally for the SWE-bench results?
lieret|7 months ago
Back then, LMs were optimized to be great at chatting, but not much else. This meant that agent scaffolds had to get very creative (and complicated) to make LMs perform useful work.
But in 2025, LMs are actively optimized for agentic coding, and we ask:
*What the simplest coding agent that could still score near SotA on the benchmarks?*
*Turns out, it just requires 100 lines of code!*
And this system still *resolves 65% of all GitHub issues in the SWE-bench verified benchmark* with Sonnet 4 (for comparison, when Anthropic launched Sonnet 4, they reported 70% with their own scaffold that was never made public).
Honestly, we're all pretty stunned ourselves—we've now spent more than a year developing SWE-agent, and would not have thought that such a small system could perform nearly as good.
I'll link to the project below (all open-source, of course). The hello world example is incredibly short & simple (and literally what gave us the 65%). But it is also meant as a serious command line tool + research project, so we provide a Claude-code style UI & some utilities on top of that.
We have some team members from Princeton/Stanford here today, let us know if you have any questions/feedback :)
Oras|7 months ago
It would be great to have the agent adding a memory (even locally) to avoid mistakes, checking for new versions of libraries, and write a list of tasks first before the execution (similar to Kiro and Trae SOLO).
scottyeager|7 months ago
I was just starting to study coding agent implementation, specifically with tool use. Seeing the insight on the README that `bash` is all a modern LLM needs to solve coding tasks was very interesting, since the trend seems to be solidly toward tools.
Being able to read the entire agent code nearly on a single screen is very instructive and inspiring to start hacking.
One thing I'm curious about is API calling efficiency. Did you happen to compare request count or token consumption of the mini agent versus full sized? Is that data available generally for the SWE-bench results?