top | item 46682777

(no title)

robbies | 1 month ago

What do you like to use instead? I’ve used the aider leaderboard a couple times, but it didn’t really stick with me

discuss

swe-REbench is interesting. The "RE" stands for re-testing after the models were launched. They periodically gather new issues from live repos on github, and have a slider where you can see the scores for all issues in a given interval. So if you wait ~2 months you can see how the models perform on new (to them) real-world issues.

It's still not as accurate as benchmarks on your own workflows, but it's better than the original benchmark. Or any other public benchmarks.

khimaros|1 month ago

Terminal Bench 2.0