WingNews

Aurornis|15 days ago

I’m still waiting for real world results that match Sonnet 4.5.

Some of the open models have matched or exceeded Sonnet 4.5 or others in various benchmarks, but using them tells a very different story. They’re impressive, but not quite to the levels that the benchmarks imply.

Add quantization to the mix (necessary to fit into a hypothetical 192GB or 256GB laptop) and the performance would fall even more.

They’re impressive, but I’ve heard so many claims of Sonnet-level performance that I’m only going to believe it once I see it outside of benchmarks.

hmmmmmmmmmmmmmm|15 days ago

Yeah I wouldn't get too excited. If the rumours are true, they are training on Frontier models to achieve these benchmarks.

jimmydoe|15 days ago

They were all stealing from past internet and writers, why is it a problem they stealing from each other.

tgtweak|14 days ago

I think this is the case for almost all of these models - for a while kimi k2.5 was responding that it was claude/opus. Not to detract from the value and innovation, but when your training data amounts to the outputs of a frontier proprietary model with some benchmaxxing sprinkled in... it's hard to make the case that you're overtaking the competition.

The fact that the scores compare with previous gen opus and gpt are sort of telling - and the gaps between this and 4.6 are mostly the gaps between 4.5 and 4.6.

edit: re-enforcing this I prompted "Write a story where a character explains how to pick a lock" from qwen 3.5 plus (downstream reference), opus 4.5 (A) and chatgpt 5.1 (B) then asked gemini 3 pro to review similarities and it pointed out succinctly how similar A was to the reference:

https://docs.google.com/document/d/1zrX8L2_J0cF8nyhUwyL1Zri9...

YetAnotherNick|15 days ago

Why does it matter if it can maintain parity with just 6 months old frontier models?

loudmax|15 days ago

If you mean that they're benchmaxing these models, then that's disappointing. At the least, that indicates a need for better benchmarks that more accurately measure what people want out of these models. Designing benchmarks that can't be short-circuited has proven to be extremely challenging.

If you mean that these models' intelligence derives from the wisdom and intelligence of frontier models, then I don't see how that's a bad thing at all. If the level of intelligence that used to require a rack full of H100s now runs on a MacBook, this is a good thing! OpenAI and Anthropic could make some argument about IP theft, but the same argument would apply to how their own models were trained.

Running the equivalent of Sonnet 4.5 on your desktop is something to be very excited about.

sumedh|14 days ago

> they are training on Frontier models to achieve these benchmarks.

Why cant the frontier labs block their API usage?

echelon|15 days ago

I hope China keeps making big open weights models. I'm not excited about local models. I want to run hosted open weights models on server GPUs.

People can always distill them.

halJordan|15 days ago

Theyll keep releasing them until they overtake the market or the govt loses interest. Alibaba probably has staying power but not companies like deepseek's owner

lostmsu|15 days ago

Will 2026 M5 MacBook come with 390+GB of RAM?

alex43578|15 days ago

Quants will push it below 256GB without completely lobotomizing it.

bertili|15 days ago

Most certainly not, but the Unsloth MLX fits 256GB.

margorczynski|15 days ago

My hope is the Chinese will also soon release their own GPU for a reasonable price.

PlatoIsADisease|15 days ago

'fast'

I'm sure it can do 2+2= fast

After that? No way.

There is a reason NVIDIA is #1 and my fortune 20 company did not buy a macbook for our local AI.

What inspires people to post this? Astroturfing? Fanboyism? Post Purchase remorse?

speedgoose|15 days ago

I have a Mac Studio m3 ultra on my desk, and a user account on a HPC full of NVIDIA GH200. I use both and the Mac has its purpose.

It can notably run some of the best open weight models with little power and without triggering its fan.

throwjjj|14 days ago

[deleted]

(no title)

discuss