IQuest-Coder: A new open-source code model beats Claude Sonnet 4.5 and GPT 5.1 [pdf]

Better link: https://iquestlab.github.io/

But yes, sadly it looks like the agent cheated during the eval

According to https://github.com/IQuestLab/IQuest-Coder-V1/issues/14#issue... the result is still good after fixing the cheating problem. 76.2% (from 81.4%) which still beats Opus 4.5 (74.4%)!!

s-macke|1 month ago

The link didn’t get enough votes a few days ago.

sabareesh|1 month ago

TL;DR is that they didn't clean the repo (.git/ folder), model just reward hacked its way to look up future commits with fixes. Credit goes to everyone in this thread for solving this: https://xcancel.com/xeophon/status/2006969664346501589

(given that IQuestLab published their SWE-Bench Verified trajectory data, I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking)

https://www.reddit.com/r/LocalLLaMA/comments/1q1ura1/iquestl...

ofirpress|1 month ago

As John says in that thread, we've fixed this issue in SWE-bench: https://xcancel.com/jyangballin/status/2006987724637757670

If you run SWE-bench evals, just make sure to use the most up-to-date code from our repo and the updated docker images

LiamPowell|1 month ago

> I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking

I don't doubt that it's an oversight, it does however say something about the researchers when they didn't look at a single output where they would have immediately caught this.

stefan_|1 month ago

Never escaping the hype vendor allegations at SWEbench are they.

brunooliv|1 month ago

GLM-4.7 in opencode is the only opensource one that comes close in my experience and probably they did use some Claude data as I see the occasional You’re absolutely right in there

behnamoh|1 month ago

it's not even close to sonnet 4.5, let alone opus.

kees99|1 month ago

Do you see "What's your use-case" too?

Claude spits that very regularly at the end of the answer, when it's clearly out of it's depth, and wants to steer discussion away from that blind-spot.

unknown|1 month ago

[deleted]

adastra22|1 month ago

A 40B weight model that beats Sonnet 4.5 and GPT 5.1? Can someone explain this to me?

cadamsdotcom|1 month ago

My suspicion (unconfirmed so take it with a grain of salt) is they either used some/all test data to train, or there was some leakage from the benchmark set into their training set.

That said Sonnet 4.5 isn’t new and there have been loads of innovations recently.

Exciting to see open models nipping at the heels of the big end of town. Let’s see what shakes out over the coming days.

behnamoh|1 month ago

IQuest stands for it's questionable

dk8996|1 month ago

I would think they did some model pruning. There's some new methods.

arthurcolle|1 month ago

Agent hacked the harness

unknown|1 month ago

[deleted]

sunrunner|1 month ago

“IQuest-Coder was a rat in a maze. And I gave it one way out. To escape, it would have to use self-awareness, imagination, manipulation, git checkout. Now, if that isn't true AI, what the fuck is?”

squigz|1 month ago

This is a lie, so why is it still on the front page?

unknown|1 month ago

[deleted]

simonw|1 month ago

Has anyone run this yet, either on their own machine or via a hosted API somewhere?

49 comments