top | item 46039244

(no title)

sqs | 3 months ago

What's super interesting is that Opus is cheaper all-in than Sonnet for many usage patterns.

Here are some early rough numbers from our own internal usage on the Amp team (avg cost $ per thread):

- Sonnet 4.5: $1.83

- Opus 4.5: $1.30 (earlier checkpoint last week was $1.55)

- Gemini 3 Pro: $1.21

Cost per token is not the right way to look at this. A bit more intelligence means mistakes (and wasted tokens) avoided.

discuss

localhost|3 months ago

Totally agree with this. I have seen many cases where a dumber model gets trapped in a local minima and burns a ton of tokens to escape from it (sometimes unsuccessfully). In a toy example (30 minute agentic coding session - create a markdown -> html compiler using a subset of commonmark test suite to hill climb on), dumber models would cost $18 (at retail token prices) to complete the task. Smarter models would see the trap and take only $3 to complete the task. YMMV.

Much better to look at cost per task - and good to see some benchmarks reporting this now.

IgorPartola|3 months ago

For me this is sub agent usage. If I ask Claude Code to use 1-3 subagents for a task, the 5 hour limit is gone in one or two rounds. Weekly limit shortly after. They just keep producing more and more documentation about each individual intermediate step to talk to each other no matter how I edit the sub agent definitions.

leo_e|3 months ago

Hard agree. The hidden cost of 'cheap' models is the complexity of the retry logic you have to write around them.

If a cheaper model hallucinates halfway through a multi-step agent workflow, I burn more tokens on verification and error correction loops than if I just used the smart model upfront. 'Cost per successful task' is the only metric that matters in production.

andai|3 months ago

Yeah, that's a great point.

ArtificialAnalysis has a "intelligence per token" metric on which all of Anthropic's models are outliers.

For some reason, they need way less output tokens than everyone else's models to pass the benchmarks.

(There are of course many issues with benchmarks, but I thought that was really interesting.)

tmaly|3 months ago

what is the typical usage pattern that would result in these cost figures?

sqs|3 months ago

Using small threads (see https://ampcode.com/@sqs for some of my public threads).

If you use very long threads and treat it as a long-and-winding conversation, you will get worse results and pay a lot more.