(no title)
Aurornis | 1 day ago
I've been playing with Qwen3-Coder-Next and the Qwen3.5 models since they were each released.
They are impressive, but they are not performing at Sonnet 4.5 level in my experience.
I have observed that they're configured to be very tenacious. If you can carefully constrain the goal with some tests they need to pass and frame it in a way to keep them on track, they will just keep trying things over and over. They'll "solve" a lot of these problems in the way that a broken clock is right twice a day, but there's a lot of fumbling to get there.
That said, they are impressive for open source models. It's amazing what you can do with self-hosted now. Just don't believe the hype that these are Sonnet 4.5 level models because you're going to be very disappointed once you get into anything complex.
kir-gadjello|23 hours ago
I'm working on a pretty complex Rust codebase right now, with hundreds of integration tests and nontrivial concurrency, and stepfun powers through.
I have no relation to stepfun, and I'm saying this purely from deep respect to the team that managed to pack this performance in 196B/11B active envelope.
jasonni|16 hours ago
copperx|19 hours ago
mycall|9 hours ago
Aurornis|7 hours ago
To be clear I never said they weren’t strong or useful. I use them for some small tasks too.
I said they’re not equivalent to SOTA models from 6 months ago, which is what is always claimed.
Then it turns into a Motte and Bailey game where that argument is replaced with the simpler argument that they’re useful for open weights models. I’m not disagreeing with that part. I’m disagree with the first assertion that they’re equivalent to Sonnet 4.5
aappleby|23 hours ago
lend000|18 hours ago
I like this benchmark that competes models against one another in competitive environments, which seems like it can't really be gamed: https://gertlabs.com
Aurornis|7 hours ago
That’s exactly what I said, though. The headline we’re commenting under claims they’re Sonnet 4.5 level but they’re not.
I don’t disagree that they’re powerful for open models. I’m pointing out that anyone reading these headlines who expects a cheap or local Sonnet 4.5 is going to discover that it’s not true.
wolvoleo|23 hours ago
I bet the cloud ones are doing it a lot more because they can also affect the runtime side which the open source ones can't.
red75prime|19 hours ago
dimgl|21 hours ago
smahs|10 hours ago
bibstha|17 hours ago
rudhdb773b|22 hours ago
If the tests haven't been published anywhere and are sufficiently different from standard problems, I would think the benchmarks would be robust to intentional over optimization.
Edit: These look decent and generally match my expectations:
https://www.apex-testing.org/
chaboud|22 hours ago
Goodhart's law shows up with people, in system design, in processor design, in education...
Models are going to be over-fit to the tests unless scruples or practical application realities intervene. It's a tale as old as machine learning.
spwa4|10 hours ago
But there's a problem with that: of course the existence of the statistical measure itself is very much a link between all those individual facts. In other words: if there is ANY causal link between the statistical measure and the events measured ... it has now become bullshit (because the law of large numbers doesn't apply anymore).
So let's put it in practice, say there's a running contest, and you display the minimum, maximum and average time of all runners that have had their turns. We all know what happens: of course the result is that the average trends up. And yet, that's exactly what statistics guarantees won't happen. The average should go up and down with roughly 50% odds when a new runner is added. This is because showing the average causes behavior changes in the next runner.
This means, of course, that basing a decision on something as trivial as what the average running time was last year can only be mathematically defensible ONCE. The second time the average is wrong, and you're basing your decision on wrong information.
But of course, not only will most people actually deny this is the case, this is also how 99.9% of human policy making works. And it's mathematically wrong! Simple, fast ... and wrong.
warpspin|12 hours ago
crystal_revenge|22 hours ago
I’ve switched to using Kimi 2.5 for all of my personal usage and am far from disappointed.
Aside from being much cheaper than the big names (yes, I’m not running it locally, but like that I could) it just works and isn’t a sycophant. Nice to get coding problems solved without any “That’s a fantastic idea!”/“great point” comments.
At least with Kimi my understanding is that beating benchmarks was a secondary goal to good developer experience.
regularfry|11 hours ago
amelius|1 day ago
And could quantization maybe partially explain the worse than expected results?
TrainedMonkey|23 hours ago
I have two of my own comments to add to that. First one is that there is problem alignment at play. Specifically - the benchmarks are mostly self-contained problems with well defined solutions and specific prompt language, humans tasks are open ended with messy prompts and much steerage. Second is that it would be interesting to test older models on brand new benchmarks to see how those compare.
Aurornis|23 hours ago
The benchmarks are public. They're guaranteed to be in the training sets by now. So the benchmarks are no longer an indicator of general performance because the specific tasks have been seen before.
> And could quantization maybe explain the worse than expected results?
You can use the models through various providers on OpenRouter cheaply without quantization.
girvo|23 hours ago
Quantisation doesn't help, but even running full fat versions of these models through various cloud providers, they still don't match Sonnet in actual agentic coding uses: at least in my experience.
noosphr|23 hours ago
The only benchmarks worth anything are dynamic ones which can be scaled up.
ekjhgkejhgk|10 hours ago
baq|14 hours ago
that said, sonnet 4.5 is not a good model today, March 1st 2026. (it blew my mind on its release day, September 29th, 2025.)
ekianjo|19 hours ago
there is nothing open "source" about them. They are open weights, that's all.
eurekin|23 hours ago
jackblemming|23 hours ago
bourjwahwah|23 hours ago
[deleted]