top | item 44838879

GPT-5 on SWE-bench: Cost and performance deep-dive

4 points| lieret | 6 months ago |mini-swe-agent.com

3 comments

order

lieret|6 months ago

We evaluated the new GPT models with a minimal agent on SWE-bench verified. GPT-5 scores 65%, mini 60%, nano 35%. Still behind Opus 5 (68%), on par with Sonnet 4 (65%). But a lot cheaper, especially mini!

Cost is tricky to compare with agents, because agents succeed fast, but fail slowly. If an agent doesn't succeed, it should just continue trying until it succeeds, or hits a run time limit. And that's (almost) what happens.

But even so, it's very clear that

1. GPT-5 is cheaper than Sonnet 4 2. GPT-5-mini is _incredibly_ cheap for what it provides (you only sacrifice some 5%pts, but end up paying maybe 1/5th of the total cost)

All of the code to reproduce our numbers is open-source. There's a box on the bottom with the exact command to run in order to reproduce our numbers.

Also very happy to answer questions here!