top | item 46037979

(no title)

llamasushi | 3 months ago

The burying of the lede here is insane. $5/$25 per MTok is a 3x price drop from Opus 4. At that price point, Opus stops being "the model you use for important things" and becomes actually viable for production workloads.

Also notable: they're claiming SOTA prompt injection resistance. The industry has largely given up on solving this problem through training alone, so if the numbers in the system card hold up under adversarial testing, that's legitimately significant for anyone deploying agents with tool access.

The "most aligned model" framing is doing a lot of heavy lifting though. Would love to see third-party red team results.

discuss

tekacs|3 months ago

This is also super relevant for everyone who had ditched Claude Code due to limits:

> For Claude and Claude Code users with access to Opus 4.5, we’ve removed Opus-specific caps. For Max and Team Premium users, we’ve increased overall usage limits, meaning you’ll have roughly the same number of Opus tokens as you previously had with Sonnet. We’re updating usage limits to make sure you’re able to use Opus 4.5 for daily work.

tifik|3 months ago

I like that for this brief moment we actually have a competitive market working in favor of consumers. I ditched my Claude subscription in favor of Gemini just last week. It won't be great when we enter the cartel equilibrium.

Aeolun|3 months ago

It’s important to note that with the introduction of Sonnet 4.5 they absolutely cratered the limits, and the opus limits in specific, so this just sort of comes closer to the situation we were actually in before.

js4ever|3 months ago

Interesting. I totally stopped using opus on my max subscription because it was eating 40% of my week quota in less than 2h

TrueDuality|3 months ago

Now THAT is great news

throwaway-aws9|3 months ago

Thanks. I unsubscribed when I busted my weekly limit in a few hours on the Max 20x plan when I had to use Opus over Sonnet. It really feels like they were off by an order of magnitude at some point when limits were introduced.

brianjking|3 months ago

They also reset limits today, which was also quite kind as I was already 11% into my weekly allocation.

astrange|3 months ago

Just avoid using Claude Research, which I assume still instantly eats most of your token limits.

sqs|3 months ago

What's super interesting is that Opus is cheaper all-in than Sonnet for many usage patterns.

Here are some early rough numbers from our own internal usage on the Amp team (avg cost $ per thread):

- Sonnet 4.5: $1.83

- Opus 4.5: $1.30 (earlier checkpoint last week was $1.55)

- Gemini 3 Pro: $1.21

Cost per token is not the right way to look at this. A bit more intelligence means mistakes (and wasted tokens) avoided.

localhost|3 months ago

Totally agree with this. I have seen many cases where a dumber model gets trapped in a local minima and burns a ton of tokens to escape from it (sometimes unsuccessfully). In a toy example (30 minute agentic coding session - create a markdown -> html compiler using a subset of commonmark test suite to hill climb on), dumber models would cost $18 (at retail token prices) to complete the task. Smarter models would see the trap and take only $3 to complete the task. YMMV.

Much better to look at cost per task - and good to see some benchmarks reporting this now.

leo_e|3 months ago

Hard agree. The hidden cost of 'cheap' models is the complexity of the retry logic you have to write around them.

If a cheaper model hallucinates halfway through a multi-step agent workflow, I burn more tokens on verification and error correction loops than if I just used the smart model upfront. 'Cost per successful task' is the only metric that matters in production.

andai|3 months ago

Yeah, that's a great point.

ArtificialAnalysis has a "intelligence per token" metric on which all of Anthropic's models are outliers.

For some reason, they need way less output tokens than everyone else's models to pass the benchmarks.

(There are of course many issues with benchmarks, but I thought that was really interesting.)

tmaly|3 months ago

what is the typical usage pattern that would result in these cost figures?

sharkjacobs|3 months ago

3x price drop almost certainly means Opus 4.5 is a different and smaller base model than Opus 4.1, with more fine tuning to target the benchmarks.

I'll be curious to see how performance compares to Opus 4.1 on the kind of tasks and metrics they're not explicitly targeting, e.g. eqbench.com

nostrademons|3 months ago

Why? They just closed a $13B funding round. Entirely possible that they're selling below-cost to gain marketshare; on their current usage the cloud computing costs shouldn't be too bad, while the benefits of showing continued growth on their frontier models is great. Hell, for all we know they may have priced Opus 4.1 above cost to show positive unit economics to investors, and then drop the price of Opus 4.5 to spur growth so their market position looks better at the next round of funding.

ACCount37|3 months ago

Probably more sparse (MoE) than Opus 4.1. Which isn't a performance killer by itself, but is a major concern. Easy to get it wrong.

cootsnuck|3 months ago

We already know distillation works pretty well. So definitely would make sense Opus 4.5 is effectively smaller (like someone else said, could be via MoE or some other technique too).

We know the big labs are chasing efficiency cans where they can.

adgjlsfhk1|3 months ago

It seems plausible that it's a similar size model and that the 3x drop is just additional hardware efficiency/lowered margin.

losvedir|3 months ago

I almost scrolled past the "Safety" section, because in the past it always seemed sort of silly sci-fi scaremongering (IMO) or things that I would classify as "sharp tool dangerous in the wrong hands". But I'm glad I stopped, because it actually talked about real, practical issues like the prompt injections that you mention. I wonder if the industry term "safety" is pivoting to refer to other things now.

shepherdjerred|3 months ago

I thought AI safety was dumb/unimportant until I saw this dataset of dangerous prompts: https://github.com/mlcommons/ailuminate/blob/main/airr_offic...

I don't love the idea of knowledge being restricted... but I also think these tools could result in harm to others in the wrong hands

wkat4242|3 months ago

Jailbreaking is trivial though. If anything really bad could happen it would have happened already.

And the prudeness of American models in particular is awful. They're really hard to use in Europe because they keep closing up on what we consider normal.

NiloCK|3 months ago

Waymos, LLMs, brain computer interfaces, dictation and tts, humanoid robots that are worth a damn.

Ye best start believing in silly sci-fi stories. Yer in one.

narrator|3 months ago

Pliney the Liberator jailbroke it in no time. Not sure if this applies to prompt injection:

https://x.com/elder_plinius/status/1993089311995314564

cmrdporcupine|3 months ago

Note the comment when you start claude code:

"To give you room to try out our new model, we've updated usage limits for Claude Code users."

That really implies non-permanence.

Xlr8head|3 months ago

Still better than perma-nonce.

AtNightWeCode|3 months ago

The cost of tokens in the docs is pretty much a worthless metric for these models. Only way to go is to plug it in and test it. My experience is that Claude is an expert at wasting tokens on nonsense. Easily 5x up on output tokens comparing to ChatGPT and then consider that Claude waste about 2-3x of tokens more by default.

windexh8er|3 months ago

This is spot on. The amount of wasteful output tokens from Claude is crazy. The actual output you're looking for might be better, but you're definitely going to pay for it in the long run.

The other angle here is that it's very easy to waste a ton of time and tokens with cheap models. Or you can more slowly dig yourself a hole with the SOTA models. But either way, and even with 1M tokens of context - things spiral at some point. It's just a question of whether you can get off the tracks with a working widget. It's always frustrating to know that "resetting" the environment is just handing over some free tokens to [model-provider-here] to recontextualize itself. I feel like it's the ultimate Office Space hack, likely unintentional, but really helps drive home the point of how unreliable all these offerings are.

Scene_Cast2|3 months ago

Still way pricier (>2x) than Gemini 3 and Grok 4. I've noticed that the latter two also perform better than Opus 4, so I've stopped using Opus.

pants2|3 months ago

Don't be so sure - while I haven't tested Opus 4.5 yet, Gemini 3 tends to use way more tokens than Sonnet 4.5. Like 5-10X more. So Gemini might end up being more expensive in practice.

wolttam|3 months ago

It's 1/3 the old price ($15/$75)

brookst|3 months ago

Not sure if that’s a joke about LLM math performance, but pedantry requires me to point out 15 / 75 = 1/5

llamasushi|3 months ago

Just updated, thanks

tom_m|3 months ago

It was already viable pricing before. You have to remember this is for business use. Many companies will pay 20% on top of an engineer's salary to have them be 200% as effective. Right?

I am truthfully surprised they dropped pricing. They don't really need to. The demand is quite high. This is all pretty much gatekeeping too (with the high pricing, across all providers). AI for coding can be expensive and companies want it to be because money is their edge. Funny because this is the same for the AI providers too. He who had the most GPUs, right?

resonious|3 months ago

Just on Claude Code, I didn't notice any performance difference from Sonnet 4.5 but if it's cheaper then that's pretty big! And it kinda confuses the original idea that Sonnet is the well rounded middle option and Opus is the sophisticated high end option.

jstummbillig|3 months ago

It does, but it also maps to the human world: Tokens/Time cost money. If either is well spent, then you save money. Thus, paying an expert ends up costing less than hiring a novice, who might cost less per hour, but takes more hours to complete the task, if they can do it at all.

It's both kinda neat and irritating, how many parallels there are between this AI paradigm and what we do.

burgerone|3 months ago

Using AI in production is no doubt an enormous security risk...

laterium|3 months ago

Where's the argument? Or we're just asserting things?

delaminator|3 months ago

Not all production processes untrusted input.

irthomasthomas|3 months ago

It's about double the speed of 4.1, too. ~60t/s vs ~30t/s. I wish it where openweights so we could discuss the architectural changes.

RestartKernel|3 months ago

> [...] that's legitimately significant for anyone deploying agents with tool access.

I disagree, even if only because your model shouldn't have more access than any other front-end.

antihero|3 months ago

Also it's really really good. Scarily good tbh. It's making PRs that work and aren't slop-filled and it figures out problems and traces through things in a way a competent engineer would rather than just fucking about.

consumer451|3 months ago

> Claude Opus 4.5 in Windsurf for 2x credits (instead of 20x for Opus 4.1)

https://old.reddit.com/r/windsurf/comments/1p5qcus/claude_op...

At the risk of sounding like a shill, in my personal experience, Windsurf is somehow still the best deal for an agentic VSCode fork.

zwnow|3 months ago

Why do all these comments sound like a sales pitch? Everytime some new bullshit model is released there are hundreds of comments like this one, pointing out 2 features talking about how huge all of this is. It isn't.

gtrealejandro|3 months ago

[deleted]

Dave_Wishengrad|3 months ago

[deleted]