top | item 47202047

(no title)

Aurornis | 1 day ago

If you're new to this: All of the open source models are playing benchmark optimization games. Every new open weight model comes with promises of being as good as something SOTA from a few months ago then they always disappoint in actual use.

I've been playing with Qwen3-Coder-Next and the Qwen3.5 models since they were each released.

They are impressive, but they are not performing at Sonnet 4.5 level in my experience.

I have observed that they're configured to be very tenacious. If you can carefully constrain the goal with some tests they need to pass and frame it in a way to keep them on track, they will just keep trying things over and over. They'll "solve" a lot of these problems in the way that a broken clock is right twice a day, but there's a lot of fumbling to get there.

That said, they are impressive for open source models. It's amazing what you can do with self-hosted now. Just don't believe the hype that these are Sonnet 4.5 level models because you're going to be very disappointed once you get into anything complex.

discuss

order

kir-gadjello|23 hours ago

Respectfully, from my experience and a few billions of tokens consumed, some opensource models really are strong and useful. Specifically StepFun-3.5-flash https://github.com/stepfun-ai/Step-3.5-Flash

I'm working on a pretty complex Rust codebase right now, with hundreds of integration tests and nontrivial concurrency, and stepfun powers through.

I have no relation to stepfun, and I'm saying this purely from deep respect to the team that managed to pack this performance in 196B/11B active envelope.

jasonni|16 hours ago

What coding agent do you use with StepFun-3.5-flash? I just tried it from siliconflow's api with opencode. The toolcalling is broken: AI_InvalidResponseDataError: Expected 'function.name' to be a string.

copperx|19 hours ago

Are you using stepfun mostly because it's free, or is it better than other models at some things?

mycall|9 hours ago

TDD is really the delineation between being successful or not when using [local] LLMs.

Aurornis|7 hours ago

> some opensource models really are strong and useful

To be clear I never said they weren’t strong or useful. I use them for some small tasks too.

I said they’re not equivalent to SOTA models from 6 months ago, which is what is always claimed.

Then it turns into a Motte and Bailey game where that argument is replaced with the simpler argument that they’re useful for open weights models. I’m not disagreeing with that part. I’m disagree with the first assertion that they’re equivalent to Sonnet 4.5

aappleby|23 hours ago

What are you running that model on?

lend000|18 hours ago

Yes and no. "Last-gen" (like, from 6 months ago) frontier models do still tend to outperform the best open source models. But some models, especially GLM-5, really have captured whatever circuitry drives pattern matching in the models they were trained off of.

I like this benchmark that competes models against one another in competitive environments, which seems like it can't really be gamed: https://gertlabs.com

Aurornis|7 hours ago

> Yes and no. "Last-gen" (like, from 6 months ago) frontier models do still tend to outperform the best open source models

That’s exactly what I said, though. The headline we’re commenting under claims they’re Sonnet 4.5 level but they’re not.

I don’t disagree that they’re powerful for open models. I’m pointing out that anyone reading these headlines who expects a cheap or local Sonnet 4.5 is going to discover that it’s not true.

wolvoleo|23 hours ago

All models are doing that. Not only the open source ones.

I bet the cloud ones are doing it a lot more because they can also affect the runtime side which the open source ones can't.

red75prime|19 hours ago

I wouldn't mind them benchmaxing my queries.

dimgl|21 hours ago

I'm using Qwen 3.5 27b on my 4090 and let me tell you. This is the first time I am seriously blown away by coding performance on a local model. They are almost always unusable. Not this time though...

smahs|10 hours ago

27B dense model is probably the best in the 3.5 lot, not absolutely but for perf:size. Its also pretty good at prose, which is a rarity for a Qwen.

bibstha|17 hours ago

You don't need a coding version of model from Qwen? the 3.5 works?

rudhdb773b|22 hours ago

Are there any up-to-date offline/private agentic coding benchmark leaderboards?

If the tests haven't been published anywhere and are sufficiently different from standard problems, I would think the benchmarks would be robust to intentional over optimization.

Edit: These look decent and generally match my expectations:

https://www.apex-testing.org/

chaboud|22 hours ago

"When a measure becomes a target, it ceases to be a good measure."

Goodhart's law shows up with people, in system design, in processor design, in education...

Models are going to be over-fit to the tests unless scruples or practical application realities intervene. It's a tale as old as machine learning.

spwa4|10 hours ago

This is because of the forbidden argument in statistics. Any statistic, even something so basic as an average, ONLY works if you can guarantee the independence of the individual facts it measures.

But there's a problem with that: of course the existence of the statistical measure itself is very much a link between all those individual facts. In other words: if there is ANY causal link between the statistical measure and the events measured ... it has now become bullshit (because the law of large numbers doesn't apply anymore).

So let's put it in practice, say there's a running contest, and you display the minimum, maximum and average time of all runners that have had their turns. We all know what happens: of course the result is that the average trends up. And yet, that's exactly what statistics guarantees won't happen. The average should go up and down with roughly 50% odds when a new runner is added. This is because showing the average causes behavior changes in the next runner.

This means, of course, that basing a decision on something as trivial as what the average running time was last year can only be mathematically defensible ONCE. The second time the average is wrong, and you're basing your decision on wrong information.

But of course, not only will most people actually deny this is the case, this is also how 99.9% of human policy making works. And it's mathematically wrong! Simple, fast ... and wrong.

warpspin|12 hours ago

Hmm, I second this. Haven't compared Qwen3.5 122B yet, but played around with OpenCode + Qwen3-Coder-Next yesterday and did manual comparisons with Claude Code and Claude Code is still far ahead in general felt "intelligence quality".

crystal_revenge|22 hours ago

> they always disappoint in actual use.

I’ve switched to using Kimi 2.5 for all of my personal usage and am far from disappointed.

Aside from being much cheaper than the big names (yes, I’m not running it locally, but like that I could) it just works and isn’t a sycophant. Nice to get coding problems solved without any “That’s a fantastic idea!”/“great point” comments.

At least with Kimi my understanding is that beating benchmarks was a secondary goal to good developer experience.

regularfry|11 hours ago

Just going to echo this. Been using K2.5 in opencode as a switch away from Opus because it was too expensive for the sorts of things I was playing with, and it's been great. There's definitely a bit of learning to get the hang of what sort of prompts to give it and to make sure there's enough documentation in the project for it, but it's remarkably capable once you're in the swing of it.

amelius|1 day ago

Are you saying that the benchmarks are flawed?

And could quantization maybe partially explain the worse than expected results?

TrainedMonkey|23 hours ago

No, what he is saying is that benchmarks are static and there is tremendous reputational and financial pressure to make benchmark number go up. So you add specific problems to training data... The result is that the model is smarter, but the benchmarks overstate the progress. Sure there are problem sets designed to be secret, but keeping secrets is hard given the fraction of planetary resources we are dedicating to making the AI numbers go up.

I have two of my own comments to add to that. First one is that there is problem alignment at play. Specifically - the benchmarks are mostly self-contained problems with well defined solutions and specific prompt language, humans tasks are open ended with messy prompts and much steerage. Second is that it would be interesting to test older models on brand new benchmarks to see how those compare.

Aurornis|23 hours ago

The models outperform on the benchmarks relative to general tasks.

The benchmarks are public. They're guaranteed to be in the training sets by now. So the benchmarks are no longer an indicator of general performance because the specific tasks have been seen before.

> And could quantization maybe explain the worse than expected results?

You can use the models through various providers on OpenRouter cheaply without quantization.

girvo|23 hours ago

Flawed? Possibly, but I think it's more that any kind of benchmark then becomes a target, and is inherently going to be a "lossy" signal as to the models actual ability in practice.

Quantisation doesn't help, but even running full fat versions of these models through various cloud providers, they still don't match Sonnet in actual agentic coding uses: at least in my experience.

noosphr|23 hours ago

It's not just the open source ones.

The only benchmarks worth anything are dynamic ones which can be scaled up.

ekjhgkejhgk|10 hours ago

I've been trying to get these things to local host and use tools. Am I right in understanding that it's impossible for these things to use tools from within llama.cpp? Do I need another "thing" to run the models? What exactly is the mechanism by which the models became aware that they're somewhere where they have tools availbale? So many questions...

baq|14 hours ago

they're distilling claude and openai obviously.

that said, sonnet 4.5 is not a good model today, March 1st 2026. (it blew my mind on its release day, September 29th, 2025.)

ekianjo|19 hours ago

> That said, they are impressive for open source models.

there is nothing open "source" about them. They are open weights, that's all.

eurekin|23 hours ago

Very good point. I'm playing with them too and got to the same conclusion.

jackblemming|23 hours ago

Death by KPIs. Management makes it too risky to do anything but benchmaxx. It will be the death of American AI companies too. Eventually, people will notice models aren’t actually getting better and the money will stop flowing. However, this might be a golden age of research as cheap GPUs flood the market and universities have their own clusters.