(no title)
sabareesh | 1 month ago
(given that IQuestLab published their SWE-Bench Verified trajectory data, I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking)
https://www.reddit.com/r/LocalLLaMA/comments/1q1ura1/iquestl...
ofirpress|1 month ago
If you run SWE-bench evals, just make sure to use the most up-to-date code from our repo and the updated docker images
LiamPowell|1 month ago
I don't doubt that it's an oversight, it does however say something about the researchers when they didn't look at a single output where they would have immediately caught this.
domoritz|1 month ago
alyxya|1 month ago
stefan_|1 month ago