(no title)
llamasushi | 3 months ago
Also notable: they're claiming SOTA prompt injection resistance. The industry has largely given up on solving this problem through training alone, so if the numbers in the system card hold up under adversarial testing, that's legitimately significant for anyone deploying agents with tool access.
The "most aligned model" framing is doing a lot of heavy lifting though. Would love to see third-party red team results.
tekacs|3 months ago
> For Claude and Claude Code users with access to Opus 4.5, we’ve removed Opus-specific caps. For Max and Team Premium users, we’ve increased overall usage limits, meaning you’ll have roughly the same number of Opus tokens as you previously had with Sonnet. We’re updating usage limits to make sure you’re able to use Opus 4.5 for daily work.
tifik|3 months ago
Aeolun|3 months ago
js4ever|3 months ago
TrueDuality|3 months ago
throwaway-aws9|3 months ago
brianjking|3 months ago
astrange|3 months ago
sqs|3 months ago
Here are some early rough numbers from our own internal usage on the Amp team (avg cost $ per thread):
- Sonnet 4.5: $1.83
- Opus 4.5: $1.30 (earlier checkpoint last week was $1.55)
- Gemini 3 Pro: $1.21
Cost per token is not the right way to look at this. A bit more intelligence means mistakes (and wasted tokens) avoided.
localhost|3 months ago
Much better to look at cost per task - and good to see some benchmarks reporting this now.
leo_e|3 months ago
If a cheaper model hallucinates halfway through a multi-step agent workflow, I burn more tokens on verification and error correction loops than if I just used the smart model upfront. 'Cost per successful task' is the only metric that matters in production.
andai|3 months ago
ArtificialAnalysis has a "intelligence per token" metric on which all of Anthropic's models are outliers.
For some reason, they need way less output tokens than everyone else's models to pass the benchmarks.
(There are of course many issues with benchmarks, but I thought that was really interesting.)
tmaly|3 months ago
sharkjacobs|3 months ago
I'll be curious to see how performance compares to Opus 4.1 on the kind of tasks and metrics they're not explicitly targeting, e.g. eqbench.com
nostrademons|3 months ago
ACCount37|3 months ago
cootsnuck|3 months ago
We know the big labs are chasing efficiency cans where they can.
adgjlsfhk1|3 months ago
losvedir|3 months ago
shepherdjerred|3 months ago
I don't love the idea of knowledge being restricted... but I also think these tools could result in harm to others in the wrong hands
wkat4242|3 months ago
And the prudeness of American models in particular is awful. They're really hard to use in Europe because they keep closing up on what we consider normal.
NiloCK|3 months ago
Ye best start believing in silly sci-fi stories. Yer in one.
narrator|3 months ago
https://x.com/elder_plinius/status/1993089311995314564
cmrdporcupine|3 months ago
"To give you room to try out our new model, we've updated usage limits for Claude Code users."
That really implies non-permanence.
Xlr8head|3 months ago
AtNightWeCode|3 months ago
windexh8er|3 months ago
The other angle here is that it's very easy to waste a ton of time and tokens with cheap models. Or you can more slowly dig yourself a hole with the SOTA models. But either way, and even with 1M tokens of context - things spiral at some point. It's just a question of whether you can get off the tracks with a working widget. It's always frustrating to know that "resetting" the environment is just handing over some free tokens to [model-provider-here] to recontextualize itself. I feel like it's the ultimate Office Space hack, likely unintentional, but really helps drive home the point of how unreliable all these offerings are.
Scene_Cast2|3 months ago
pants2|3 months ago
wolttam|3 months ago
brookst|3 months ago
llamasushi|3 months ago
tom_m|3 months ago
I am truthfully surprised they dropped pricing. They don't really need to. The demand is quite high. This is all pretty much gatekeeping too (with the high pricing, across all providers). AI for coding can be expensive and companies want it to be because money is their edge. Funny because this is the same for the AI providers too. He who had the most GPUs, right?
resonious|3 months ago
jstummbillig|3 months ago
It's both kinda neat and irritating, how many parallels there are between this AI paradigm and what we do.
burgerone|3 months ago
laterium|3 months ago
delaminator|3 months ago
irthomasthomas|3 months ago
RestartKernel|3 months ago
I disagree, even if only because your model shouldn't have more access than any other front-end.
antihero|3 months ago
consumer451|3 months ago
> Claude Opus 4.5 in Windsurf for 2x credits (instead of 20x for Opus 4.1)
https://old.reddit.com/r/windsurf/comments/1p5qcus/claude_op...
At the risk of sounding like a shill, in my personal experience, Windsurf is somehow still the best deal for an agentic VSCode fork.
zwnow|3 months ago
gtrealejandro|3 months ago
[deleted]
Dave_Wishengrad|3 months ago
[deleted]