top | item 47052072

(no title)

andrewchilds | 12 days ago

Many people have reported Opus 4.6 is a step back from Opus 4.5 - that 4.6 is consuming 5-10x as many tokens as 4.5 to accomplish the same task: https://github.com/anthropics/claude-code/issues/23706

I haven't seen a response from the Anthropic team about it.

I can't help but look at Sonnet 4.6 in the same light, and want to stick with 4.5 across the board until this issue is acknowledged and resolved.

discuss

order

donovandikaio|4 minutes ago

Opus 4.6 has been a hit-and-miss for me. It does extremely well on very complex, long-running tasks but also struggles with very basic, seemingly straightforward work and often provides conflicting recommendations. For example, just this morning Opus 4.6 provided two options, recommended option 1, and at the end of the same message asked to start option 2; this does not happen in Opus 4.5.

For now, my workflow will be for everyday tasks claude-opus-4-5 and opus 4.6 for more complex work.

wongarsu|12 days ago

Keep in mind that the people who experience issues will always be the loudest.

I've overall enjoyed 4.6. On many easy things it thinks less than 4.5, leading to snappier feedback. And 4.6 seems much more comfortable calling tools: it's much more proactive about looking at the git history to understand the history of a bug or feature, or about looking at online documentation for APIs and packages.

A recent claude code update explicitly offered me the option to change the reasoning level from high to medium, and for many people that seems to help with the overthinking. But for my tasks and medium-sized code bases (far beyond hobby but far below legacy enterprise) I've been very happy with the default setting. Or maybe it's about the prompting style, hard to say

evilhackerdude|12 days ago

keep in mind that people who point out a regression and measure the actual #tok, which costs $money, aren't just "being loud" — someone diffed session context usaage and found 4.6 burning >7x the amount of context on a task that 4.5 did in under 2 MB⁣.

SatvikBeri|12 days ago

I've also seen Opus 4.6 as a pure upgrade. In particular, it's noticeably better at debugging complex issues and navigating our internal/custom framework.

perelin|12 days ago

Mirrors my experience as well. Especially the pro-activeness in tool calling sticks out. It goes web searching to augment knowledge gaps on its own way more often.

galaxyLogic|12 days ago

Do you need to upload your git for it to analyuze it? Or are they reading it off github ?

MrCheeze|12 days ago

In my experience with the models (watching Claude play Pokemon), the models are similar in intelligence, but are very different in how they approach problems: Opus 4.5 hyperfocuses on completing its original plan, far more than any older or newer version of Claude. Opus 4.6 gets bored quickly and is constantly changing its approach if it doesn't get results fast. This makes it waste more time on"easy" tasks where the first approach would have worked, but faster by an order of magnitude on "hard" tasks that require trying different approaches. For this reason, it started off slower than 4.5, but ultimately got as far in 9 days as 4.5 got in 59 days.

bjt12345|12 days ago

I think that's because Opus 4.6 has more "initiative".

Opus 4.6 can be quite sassy at times, the other day I asked it if it were "buttering me up" and it candidly responded "Hey you asked me to help you write a report with that conclusion, not appraise it."

KronisLV|12 days ago

I got the Max subscription and have been using Opus 4.6 since, the model is way above pretty much everything else I've tried for dev work and while I'd love for Anthropic to let me (easily) work on making a hostable server-side solution for parallel tasks without having to go the API key route and not have to pay per token, I will say that the Claude Code desktop app (more convenient than the TUI one) gets me most of the way there too.

DaKevK|12 days ago

Genuinely one of the more interesting model evals I've seen described. The sunk cost framing makes sense -- 4.5 doubles down, 4.6 cuts losses faster. 9 days vs 59 is a wild result. Makes me wonder how much of the regression complaints are from people hitting 4.6 on tasks where the first approach was obviously correct.

Jach|12 days ago

I haven't kept up with the Claude plays stuff, did it ever actually beat the game? I was under the impression that the harness was artificially hampering it considering how comparatively more easily various versions of ChatGPT and Gemini had beat the game and even moved on to beating Pokemon Crystal.

data-ottawa|12 days ago

I think this depends on what reasoning level your Claude Code is set to.

Go to /models, select opus, and the dim text at the bottom will tell you the reasoning level.

High reasoning is a big difference versus 4.5. 4.6 high uses a lot of tokens for even small tasks, and if you have a large codebase it will fill almost all context then compact often.

minimaxir|12 days ago

I set reasoning to Medium after hitting these issues and it did not make much of a difference. Most of the context window is still filled during the Explore tool phase (that supposedly uses Haiku swarms) which wouldn't be impacted by Opus reasoning.

_zoltan_|12 days ago

I'm using the 1M context 4.6 and it's great.

honeycrispy|12 days ago

Glad it's not just me. I got a surprise the other day when I was notified that I had burned up my monthly budget in just a few days on 4.6

Topfi|12 days ago

In my evals, I was able to rather reliably reproduce an increase in output token amount of roughly 15-45% compared to 4.5, but in large part this was limited to task inference and task evaluation benchmarks. These are made up of prompts that I intentionally designed to be less then optimal, either lacking crucial information (requiring a model to output an inference to accomplish the main request) or including a request for a less than optimal or incorrect approach to resolving a task (testing whether and how a prompt is evaluated by a model against pure task adherence). The clarifying question many agentic harnesses try to provide (with mixed success) are a practical example of both capabilities and something I do rate highly in models, as long as task adherence isn't affected overly negatively because of it.

In either case, there has been an increase between 4.1 and 4.5, as well as now another jump with the release of 4.6. As mentioned, I haven't seen a 5x or 10x increase, a bit below 50% for the same task was the maximum I saw and in general, of more opaque input or when a better approach is possible, I do think using more tokens for a better overall result is the right approach.

In tasks which are well authored and do not contain such deficiencies, I have seen no significant difference in either direction in terms of pure token output numbers. However, with models being what they are and past, hard to reproduce regressions/output quality differences, that additionally only affected a specific subset of users, I cannot make a solid determination.

Regarding Sonnet 4.6, what I noticed is that the reasoning tokens are very different compared to any prior Anthropic models. They start out far more structured, but then consistently turn more verbose akin to a Google model.

weinzierl|12 days ago

Today I asked Sonnet 4.5 a question and I got a banner at the bottom that I am using a legacy model and have to continue the conversation on another model. The model button had changed to be labeled "Legacy model". Yeah, I guess it wasn't legacy a sec ago.

(Currently I can use Sonnet 4.5 under More models, so I guess the above was just a glitch)

etothet|12 days ago

I definitely noticed this on Opus 4.6. I moved back to 4.5 until I see (or hear about) an improvement.

hedora|12 days ago

I’ve noticed the opaque weekly quota meter goes up more slowly with 4.6, but it more frequently goes off and works for an hour+, with really high reported token counts.

Those suggest opposite things about anthropic’s profit margins.

I’m not convinced 4.6 is much better than 4.5. The big discontinuous breakthroughs seem to be due to how my code and tests are structured, not model bumps.

ctoth|12 days ago

For me it's the ... unearned confidence that 4.5 absolutely did not have?

I have a protocol called "foreman protocol" where the main agent only dispatches other agents with prompt files and reads report files from the agents rather than relying on the janky subagent communication mechanisms such as task output.

What this has given me also is a history of what was built and why it was built, because I have a list of prompts that were tasked to the subagents. With Opus 4.5 it would often leave the ... figuring out part? to the agents. In 4.6 it absolutely inserts what it thinks should happen/its idea of the bug/what it believes should be done into the prompt, which often screws up the subagent because it is simply wrong and because it's in the prompt the subagent doesn't actually go look. Opus 4.5 would let the agent figure it out, 4.6 assumes it knows and is wrong

DaKevK|12 days ago

Have you tried framing the hypothesis as a question in the dispatch prompt rather than a statement? Something like -- possible cause: X, please verify before proceeding -- instead of stating it as fact. Might break the assumption inheritance without changing the overall structure.

nerdsniper|12 days ago

In terms of performance, 4.6 seems better. I’m willing to pay the tokens for that. But if it does use tokens at a much faster rate, it makes sense to keep 4.5 around for more frugal users

I just wouldn’t call it a regression for my use case, i’m pretty happy with it.

baq|12 days ago

Sonnet 4.5 was not worth using at all for coding for a few months now, so not sure what we're comparing here. If Sonnet 4.6 is anywhere near the performance they claim, it's actually a viable alternative.

Snakes3727|12 days ago

Imo I found opus 4.6 to be a pretty big step back. Our usage has skyrocketed since 4.6 has come out and the workload has not really changed.

However I can honestly say anthropic is pretty terrible about support, to even billing. My org has a large enterprise contract with anthropic and we have been hitting endless rate limits across the entire org. They have never once responded to our issues, or we get the same generic AI response.

So odds of them addressing issues or responding to people feels low.

cjbarber|12 days ago

I wonder if it's actually from CC harness updates that make it much more inclined to use subagents, rather than from the model update.

j45|12 days ago

I have often noticed a difference too, and it's usually in lockstep with needing to adjust how I am prompting.

Put in a different way, I have to keep developing my prompting / context / writing skills at all times, ahead of the curve, before they're needed to be adjusted.

cheema33|12 days ago

> Many people have reported Opus 4.6 is a step back from Opus 4.5.

Many people say many things. Just because you read it on the Internet, doesn't mean that it is true. Until you have seen hard evidence, take such proclamations with large grains of salt.

OtomotO|12 days ago

Definitely my experience as well.

No better code, but way longer thinking and way more token usage.

DetroitThrow|12 days ago

I much prefer 4.6. It often finds missed edge cases more often than 4.5. If I cared about token usage so much, I would use Sonnet or Haiku.

Foobar8568|12 days ago

It goes into plan mode and/or heavy multiple agent for any reasons, and hundred thousands of tokens are used within a few minutes.

minimaxir|12 days ago

I've been tempted to add to my CLAUDE.md "Never use the Plan tool, you are a wild rebel who only YOLOs."

yakbarber|12 days ago

Opus 4.6 is so much better at building complex systems than 4.5 it's ridiculous.

grav|12 days ago

I fail to understand how two LLMs would be "consuming" a different amount of tokens given the same input? Does it refer to the number of output tokens? Or is it in the context of some "agentic loop" (eg Claude Code)?

lemonfever|12 days ago

Most LLMs output a whole bunch of tokens to help them reason through a problem, often called chain of thought, before giving the actual response. This has been shown to improve performance a lot but uses a lot of tokens

jcims|12 days ago

One very specific and limited example, when asked to build something 4.6 seems to do more web searches in the domain to gather latest best practices for various components/features before planning/implementing.

andrewchilds|12 days ago

I've found that Opus 4.6 is happy to read a significant amount of the codebase in preparation to do something, whereas Opus 4.5 tends to be much more efficient and targeted about pulling in relevant context.

Gracana|12 days ago

They're talking about output consuming from the pool of tokens allowed by the subscription plan.

bsamuels|12 days ago

thinking tokens, output tokens, etc. Being more clever about file reads/tool calling.

dakolli|12 days ago

I called this many times over the last few weeks on this website (and got downvoted every time), that the next generation of models would become more verbose, especially for agentic tool calling to offset the slot machine called CC's propensity to light the money on fire that's put into it.

At least in vegas they don't pour gasoline on the cash put into their slot machines.

reed1234|12 days ago

not in my experience

reed1234|12 days ago

"Opus 4.6 often thinks more deeply and more carefully revisits its reasoning before settling on an answer. This produces better results on harder problems, but can add cost and latency on simpler ones. If you’re finding that the model is overthinking on a given task, we recommend dialing effort down from its default setting (high) to medium."[1]

I doubt it is a conspiracy.

[1] https://www.anthropic.com/news/claude-opus-4-6

PlatoIsADisease|12 days ago

Don't take this seriously, but here is what I imagined happened:

Sam/OpenAI, Google, and Claude met at a park, everyone left their phones in the car.

They took a walk and said "We are all losing money, if we secretly degrade performance all at the same time, our customers will all switch, but they will all switch at the same time, balancing things... wink wink wink"