top | item 47054498

(no title)

jorl17 | 12 days ago

I ran the same test I ran on Opus 4.6: feeding it my whole personal collection of ~900 poems which spans ~16 years

It is a far cry from Opus 4.6.

Opus 4.6 was (is!) a giant leap, the largest since Gemini 2.5 pro. Didn't hallucinate anything and produced honestly mind-blowing analyses of the collection as a whole. It was a clear leap forward.

Sonnet 4.6 feels like an evolution of whatever the previous models were doing. It is marginally better in the sense that it seemed to make fewer mistakes or with a lower level of severity, but ultimately it made all the usual mistakes (making things up, saying it'll quote a poem and then quoting another, getting time periods mixed up, etc).

My initial experiments with coding leave the same feeling. It is better than previous similar models, but a long distance away from Opus 4.6. And I've really been spoiled by Opus.

discuss

K0balt|12 days ago

Opus 4.6 is outstanding for code, and for the little I have used it outside of that context, in everything else I have used it with. The productivity with code is at least 3x what I was getting with 5.2, and it can handle entire projects fairly responsibly. It doesn’t patronize the user, and it makes a very strong effort to capture and follow intentions. Unlike 5.2, I’ve never had to throw out a days work that it covertly screwed up taking shortcuts and just guessing.

renmillar|12 days ago

That last part is a real one though, mine tried to debug a Dockerfile by poking around my local environment outside of Docker today.

linolevan|12 days ago

Oh! Poem guy is back, hey!

I like seeing this analysis on new model releases, any chance you can aggregate your opinions in one place (instead of the hackernews comment sections for these model releases)?

hypercube33|11 days ago

Opus 4.6 has been awful for me and my team. It goes immediately off the rails and jumps to conclusions on wants and asks and just keeps chugging along forever and won't let anything stop it down whatever path it decides. 4.5 was awesome and is our still go-to model.

majora2007|11 days ago

That's interesting, 4.6 is finally when AI started to become good in my eyes. I have a very strict plan phase, argue, plan then partial execute. I like it to do boilerplate then I do the hard stuff myself and have it do a once over at the end.

Although I have had it try to debug something and just get stuck chugging tokens.

1broseidon|11 days ago

I have found this to be true too and I thought I was the only one. Everyone is praising 4.6 and while it’s great at agentic and tool use, it does not follow instructions as cleanly as 4.5 - I also feel like 4.5 was just way more efficient too

cube2222|11 days ago

This seems to agree with my own previous tests of Sonnet vs Opus (not on this version). If I give them a task with a large list of constraints ("do this, don't do this, make sure of this"), like 20-40, Sonnet will forget half of it, while Opus correctly applies all directives.

My intuition is this is just related to model size / its "working memory", and will likely neither be fixed by training Sonnet with Opus nor by steadily optimizing its agentic capabilities.

versteegen|11 days ago

I'd agree that this effect is probably mainly due to architectural parameters such as the number and dimensions of heads, and hidden dimension. But not so much the model size (number of parameters) or less training.

Saw something about Sonnet 4.6 having had a greatly increased amount of RL training over 4.5.

jxmesth|11 days ago

I'm curious how this would compare with codex 5.3. I've heard Codex actually is pretty good but Opus 4.6 has become synonymous with AI coding because all the big names praise it. I haven't compared them against each other though so can't really draw a conclusion.

zarzavat|11 days ago

There are no universals. You have to try it on your particular codebase and see what works for you.

For me, OpenAI is ahead in intelligence, and Anthropic is ahead in alignment. I use both but for different tasks.

Given the pace of change, intuition is somewhat of a liability: what's true today may not be true tomorrow. You have to constantly keep an open mind and try new things.

Listening to influencers is a waste of time.

stingraycharles|12 days ago

Given than Sonnet is the cheaper “workhorse” alternative for Opus, isn’t this expected?

hesgyrxgh|11 days ago

I'm curious if you tried the same prompt for chatgpt 5.2 Did it not give you a mind blowing analysis?

Valakas_|11 days ago

Thanks for testing and sharing your results.

slopinthebag|12 days ago

How do you evaluate the analyses?