top | item 46472667

IQuest-Coder: A new open-source code model beats Claude Sonnet 4.5 and GPT 5.1 [pdf]

182 points| shenli3514 | 1 month ago |github.com

49 comments

order

sabareesh|1 month ago

TL;DR is that they didn't clean the repo (.git/ folder), model just reward hacked its way to look up future commits with fixes. Credit goes to everyone in this thread for solving this: https://xcancel.com/xeophon/status/2006969664346501589

(given that IQuestLab published their SWE-Bench Verified trajectory data, I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking)

https://www.reddit.com/r/LocalLLaMA/comments/1q1ura1/iquestl...

LiamPowell|1 month ago

> I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking

I don't doubt that it's an oversight, it does however say something about the researchers when they didn't look at a single output where they would have immediately caught this.

stefan_|1 month ago

Never escaping the hype vendor allegations at SWEbench are they.

brunooliv|1 month ago

GLM-4.7 in opencode is the only opensource one that comes close in my experience and probably they did use some Claude data as I see the occasional You’re absolutely right in there

behnamoh|1 month ago

it's not even close to sonnet 4.5, let alone opus.

kees99|1 month ago

Do you see "What's your use-case" too?

Claude spits that very regularly at the end of the answer, when it's clearly out of it's depth, and wants to steer discussion away from that blind-spot.

adastra22|1 month ago

A 40B weight model that beats Sonnet 4.5 and GPT 5.1? Can someone explain this to me?

cadamsdotcom|1 month ago

My suspicion (unconfirmed so take it with a grain of salt) is they either used some/all test data to train, or there was some leakage from the benchmark set into their training set.

That said Sonnet 4.5 isn’t new and there have been loads of innovations recently.

Exciting to see open models nipping at the heels of the big end of town. Let’s see what shakes out over the coming days.

behnamoh|1 month ago

IQuest stands for it's questionable

dk8996|1 month ago

I would think they did some model pruning. There's some new methods.

arthurcolle|1 month ago

Agent hacked the harness

sunrunner|1 month ago

“IQuest-Coder was a rat in a maze. And I gave it one way out. To escape, it would have to use self-awareness, imagination, manipulation, git checkout. Now, if that isn't true AI, what the fuck is?”

squigz|1 month ago

This is a lie, so why is it still on the front page?

simonw|1 month ago

Has anyone run this yet, either on their own machine or via a hosted API somewhere?