TL;DR is that they didn't clean the repo (.git/ folder), model just reward hacked its way to look up future commits with fixes. Credit goes to everyone in this thread for solving this: https://xcancel.com/xeophon/status/2006969664346501589
(given that IQuestLab published their SWE-Bench Verified trajectory data, I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking)
> I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking
I don't doubt that it's an oversight, it does however say something about the researchers when they didn't look at a single output where they would have immediately caught this.
GLM-4.7 in opencode is the only opensource one that comes close in my experience and probably they did use some Claude data as I see the occasional You’re absolutely right in there
Claude spits that very regularly at the end of the answer, when it's clearly out of it's depth, and wants to steer discussion away from that blind-spot.
My suspicion (unconfirmed so take it with a grain of salt) is they either used some/all test data to train, or there was some leakage from the benchmark set into their training set.
That said Sonnet 4.5 isn’t new and there have been loads of innovations recently.
Exciting to see open models nipping at the heels of the big end of town. Let’s see what shakes out over the coming days.
“IQuest-Coder was a rat in a maze. And I gave it one way out. To escape, it would have to use self-awareness, imagination, manipulation, git checkout. Now, if that isn't true AI, what the fuck is?”
denysvitali|1 month ago
But yes, sadly it looks like the agent cheated during the eval
denysvitali|1 month ago
s-macke|1 month ago
sabareesh|1 month ago
(given that IQuestLab published their SWE-Bench Verified trajectory data, I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking)
https://www.reddit.com/r/LocalLLaMA/comments/1q1ura1/iquestl...
ofirpress|1 month ago
If you run SWE-bench evals, just make sure to use the most up-to-date code from our repo and the updated docker images
LiamPowell|1 month ago
I don't doubt that it's an oversight, it does however say something about the researchers when they didn't look at a single output where they would have immediately caught this.
stefan_|1 month ago
brunooliv|1 month ago
behnamoh|1 month ago
kees99|1 month ago
Claude spits that very regularly at the end of the answer, when it's clearly out of it's depth, and wants to steer discussion away from that blind-spot.
unknown|1 month ago
[deleted]
adastra22|1 month ago
cadamsdotcom|1 month ago
That said Sonnet 4.5 isn’t new and there have been loads of innovations recently.
Exciting to see open models nipping at the heels of the big end of town. Let’s see what shakes out over the coming days.
behnamoh|1 month ago
dk8996|1 month ago
arthurcolle|1 month ago
unknown|1 month ago
[deleted]
sunrunner|1 month ago
squigz|1 month ago
unknown|1 month ago
[deleted]
simonw|1 month ago