[I'm one of the co-creators of SWE-bench] The team managed to improve on the already very strong o3 results on SWE-bench, but it's interesting that we're just seeing an improvement of a few percentage points. I wonder if getting to 85% from 75% on Verified is going to take as long as it took to get from 20% to 75%.
Snuggly73|9 months ago
Look at the results from multi swe bench - https://multi-swe-bench.github.io/#/
swe polybench - https://amazon-science.github.io/SWE-PolyBench/
Kotlin bench - https://firebender.com/leaderboard
Bjorkbat|9 months ago
kristianp|9 months ago
ofirpress|9 months ago
We also have SWE-bench Multimodal which adds a twist I haven't seen elsewhere: https://www.swebench.com/multimodal.html
mr_north_london|9 months ago