top | item 46873285

(no title)

skhameneh | 26 days ago

It’s hard to elaborate just how wild this model might be if it performs as claimed. The claims are this can perform close to Sonnet 4.5 for assisted coding (SWE bench) while using only 3B active parameters. This is obscenely small for the claimed performance.

discuss

Aurornis|26 days ago

I experimented with the Q2 and Q4 quants. First impression is that it's amazing we can run this locally, but it's definitely not at Sonnet 4.5 level at all.

Even for my usual toy coding problems it would get simple things wrong and require some poking to get to it.

A few times it got stuck in thinking loops and I had to cancel prompts.

This was using the recommended settings from the unsloth repository. It's always possible that there are some bugs in early implementations that need to be fixed later, but so far I don't see any reason to believe this is actually a Sonnet 4.5 level model.

Kostic|26 days ago

I would not go below q8 if comparing to sonnet.

cubefox|26 days ago

> I experimented with the Q2 and Q4 quants.

Of course you get degraded performance with this.

margalabargala|26 days ago

Wonder where it falls on the Sonnet 3.7/4.0/4.5 continuum.

3.7 was not all that great. 4 was decent for specific things, especially self contained stuff like tests, but couldn't do a good job with more complex work. 4.5 is now excellent at many things.

If it's around the perf of 3.7, that's interesting but not amazing. If it's around 4, that's useful.

cmrdporcupine|26 days ago

It feels more like Haiku level than Sonnet 4.5 from my playing with it.

cirrusfan|26 days ago

If it sounds too good to be true…

theshrike79|26 days ago

Should be possible with optimised models, just drop all "generic" stuff and focus on coding performance.

There's no reason for a coding model to contain all of ao3 and wikipedia =)

FuckButtons|26 days ago

There have been advances recently (last year) in scaling deep rl by a significant amount, their announcement is in line with a timeline of running enough experiments to figure out how to leverage that in post training.

Importantly, this isn’t just throwing more data at the problem in an unstructured way, afaik companies are getting as many got histories as they can and doing something along the lines of, get an llm to checkpoint pull requests, features etc and convert those into plausible input prompts, then run deep rl with something which passes the acceptance criteria / tests as the reward signal.

Der_Einzige|26 days ago

It literally always is. HN Thought DeepSeek and every version of Kimi would finally dethrone the bigger models from Anthropic, OpenAI, and Google. They're literally always wrong and average knowledge of LLMs here is shockingly low.