top | item 45598388

(no title)

rbitar | 4 months ago

Where do you get the 220 token/second? Genuinely curious as that would be very impressive for a model comparable to sonnet 4. OpenRouter currently publishing around 116/tps[1]

[1] https://openrouter.ai/anthropic/claude-haiku-4.5

discuss

Topfi|4 months ago

Was just about to post that Haiku 4.5 does something I have never encountered before [0], there is a massive delta between token/sec depending on the query. Some variance including task specific is of course nothing new, but never as pronounced and reproducible as here.

A few examples, prompted at UTC 21:30-23:00 via T3 Chat [0]:

Prompt 1 — 120.65 token/sec — https://t3.chat/share/tgqp1dr0la

Prompt 2 — 118.58 token/sec — https://t3.chat/share/86d93w093a

Prompt 3 — 203.20 token/sec — https://t3.chat/share/h39nct9fp5

Prompt 4 — 91.43 token/sec — https://t3.chat/share/mqu1edzffq

Prompt 5 — 167.66 token/sec — https://t3.chat/share/gingktrf2m

Prompt 6 — 161.51 token/sec — https://t3.chat/share/qg6uxkdgy0

Prompt 7 — 168.11 token/sec — https://t3.chat/share/qiutu67ebc

Prompt 8 — 203.68 token/sec — https://t3.chat/share/zziplhpw0d

Prompt 9 — 102.86 token/sec — https://t3.chat/share/s3hldh5nxs

Prompt 10 — 174.66 token/sec — https://t3.chat/share/dyyfyc458m

Prompt 11 — 199.07 token/sec — https://t3.chat/share/7t29sx87cd

Prompt 12 — 82.13 token/sec — https://t3.chat/share/5ati3nvvdx

Prompt 13 — 94.96 token/sec — https://t3.chat/share/q3ig7k117z

Prompt 14 — 190.02 token/sec — https://t3.chat/share/hp5kjeujy7

Prompt 15 — 190.16 token/sec — https://t3.chat/share/77vs6yxcfa

Prompt 16 — 92.45 token/sec — https://t3.chat/share/i0qrsvp29i

Prompt 17 — 190.26 token/sec — https://t3.chat/share/berx0aq3qo

Prompt 18 — 187.31 token/sec — https://t3.chat/share/0wyuk0zzfc

Prompt 19 — 204.31 token/sec — https://t3.chat/share/6vuawveaqu

Prompt 20 — 135.55 token/sec — https://t3.chat/share/b0a11i4gfq

Prompt 21 — 208.97 token/sec — https://t3.chat/share/al54aha9zk

Prompt 22 — 188.07 token/sec — https://t3.chat/share/wu3k8q67qc

Prompt 23 — 198.17 token/sec — https://t3.chat/share/0bt1qrynve

Prompt 24 — 196.25 token/sec — https://t3.chat/share/nhnmp0hlc5

Prompt 25 — 185.09 token/sec — https://t3.chat/share/ifh6j4d8t5

I ran each prompt three times and got (within expected variance meaning less than 5% plus or minus) the same token/sec results for the respective prompt. Each used Claude Haiku 4.5 with "High reasoning". Will continue testing, but this is beyond odd. I will add that my very early evals leaned heavily into pure code output, where 200 token/sec is consistently possible at the moment, but it is certainly not the average as claimed before, there I was mistaken. That being said, even across a wider range of challenges, we are above 160 token/sec and if you solely focus on coding, whether Rust or React, Haiku 4.5 is very swift.

[0] Normally not using T3 Chat for evals, just easier to share prompts this way, though was disappointed to find that the model information (token/sec, TTF, etc.) can't be enabled without an account. Also, these aren't the prompts I usually use for evals. Those I try to keep somewhat out of training by only using paid for API for benchmarks. As anything on Hacker News is most assuredly part of model training, I decided to write some quick and dirty prompts to highlight what I have been seeing.

rbitar|4 months ago

Interesting and if they are using speculative decoding that variance would make sense. Also your numbers line up with what openrouter is now publishing at 169.1tps [1]

Anthropic mentioned this model is more then twice as fast as claude sonnet 4 [2], which OpenRouter averaged at 61.72 tps for sonnet 4 [3]. If these numbers hold we're really looking at an almost 3x improvement in throughput and less then half the initial latency.

[1] https://openrouter.ai/anthropic/claude-haiku-4.5 [2] https://www.anthropic.com/news/claude-haiku-4-5 [3] https://openrouter.ai/anthropic/claude-sonnet-4

cromulen|4 months ago

That's what you get when you use speculative decoding and focus / overfit the draft model on coding. Then when the answer is out of distribution for the draft model, you get increased token rejections by the main model and throughput suffers. This probably still makes sense for them if they expect a lot of their load will come from claude code and they need to make it economical.