top | item 44800496

(no title)

haaz | 7 months ago

it is barely an improvement according to their own benchmarks. not saying thats a bad thing, but not enough for anybody to notice any difference

discuss

waynenilsen|7 months ago

i think its probably mostly vibes but that still counts, this is not in the charts

> Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.

esafak|6 months ago

That is a big improvement.

ttoinou|7 months ago

That's why they named it 4.1 and not 4.5

zamadatix|7 months ago

When it's "that's why they incremented the version by a tenth instead of a half" you know things have really started to slow for the large models.

gloosx|7 months ago

They need to leave some room to release 10 more models. They could crank benchmarks to 100% but then no new model is needed lol? Pretty sure these pretty benchmark graphs are all completely staged marketing numbers since they do solve the same problems they are being trained on – no novel or unknown problematic is presented to them.

Topfi|7 months ago

I am still very early, but output quality wise, yes, there does not seem to be any noticeable improvement in my limited personal testing suite. What I have noticed though is subjectively better adherence to instructions and documentation provided outside the main prompt, though I have no way to quantify or reliably test that yet. So beyond reliably finding Needles-in-the-Haystack (which Frontier models have done well on lately), Opus 4.1 seems to do better in following those needles even if not explicitly guided to compared to Opus 4.

onlyrealcuzzo|7 months ago

I will only add that it's interesting that in the results graphic, they simply highlighted Opus 4.1 - choosing not to display which models have the best scores - as Opus 4.1 only scored the best on about half of the benchmarks - and was worse than Opus 4.0 on at least one measure.

levocardia|7 months ago

"You pay $20/mo for X, and now I'm giving you 1.05*X for the same price." Outrageous!

leetharris|7 months ago

Good! I'm glad they are just giving us small updates. Opus 4 just came out, if you have small improvements, why not just release them? There's no downside for us.

AstroBen|7 months ago

I don't think this could even be called an improvement? It's small enough that it could just be random chance

j_bum|7 months ago

I’ve always wondered about this actually. My assumption is that they always “pick the best” result from these tests.

Instead, ideally they’d run the benchmark tests many times, and share all of the results so we could make statistical determinations.