top | item 44801063

(no title)

sadiq | 6 months ago

Looks like Groq (at 1k+ tokens/second) and Fireworks are already live on openrouter: https://openrouter.ai/openai/gpt-oss-120b

$0.15M in / $0.6-0.75M out

edit: Now Cerebras too at 3,815 tps for $0.25M / $0.69M out.

discuss

podnami|6 months ago

Wow this was actually blazing fast. I prompted "how can the 45th and 47th presidents of america share the same parents?"

On ChatGPT.com o3 thought for for 13 seconds, on OpenRouter GPT OSS 120B thought for 0.7 seconds - and they both had the correct answer.

swores|6 months ago

I'm not sure that's a particularly good question for concluding something positive about the "thought for 0.7 seconds" - it's such a simple answer, ChatGPT 4o (with no thinking time) immediately answered correctly. The only surprising thing in your test is that o3 wasted 13 seconds thinking about it.

nisegami|6 months ago

Interesting choice of prompt. None of the local models I have in ollama (consumer mid range gpu) were able to get it right.

golergka|6 months ago

When I pay attention to o3 CoT, I notice it spends a few passes thinking about my system prompt. Hard to imagine this question is hard enough to spend 13 seconds on.

Imustaskforhelp|6 months ago

Not gonna lie but I got sorta goosebumps

I am not kidding but such progress from a technological point of view is just fascinating!

xpe|6 months ago

How many people are discussing this after one person did 1 prompt with 1 data point for each model and wrote a comment?

What is being measured here? For end-to-end time, one model is:

t_total = t_network + t_queue + t_batch_wait + t_inference + t_service_overhead

tekacs|6 months ago

I apologize for linking to Twitter, but I can't post a video here, so:

https://x.com/tekacs/status/1952788922666205615

Asking it about a marginally more complex tech topic and getting an excellent answer in ~4 seconds, reasoning for 1.1 seconds...

I am _very_ curious to see what GPT-5 turns out to be, because unless they're running on custom silicon / accelerators, even if it's very smart, it seems hard to justify not using these open models on Groq/Cerebras for a _huge_ fraction of use-cases.

tekacs|6 months ago

Cleanshot link for those who don't want to go to X: https://share.cleanshot.com/bkHqvXvT

tekacs|6 months ago

A few days ago I posted a slowed-down version of the video demo on someone's repo because it was unreadably fast due to being sped up.

https://news.ycombinator.com/item?id=44738004

... today, this is a real-time video of the OSS thinking models by OpenAI on Groq and I'd have to slow it down to be able to read it. Wild.

sigmar|6 months ago

Non-rhetorically, why would someone pay for o3 api now that I can get this open model from openai served for cheaper? Interesting dynamic... will they drop o3 pricing next week (which is 10-20x the cost[1])?

[1] currently $3M in/ $8M out https://platform.openai.com/docs/pricing

gnulinux|6 months ago

Not even that, even if o3 being marginally better is important for your task (let's say) why would anyone use o4-mini? It seems almost 10x the price and same performance (maybe even less): https://openrouter.ai/openai/o4-mini

gnulinux|6 months ago

Wow, that's significantly cheaper than o4-mini which seems to be on part with gpt-oss-120b. ($1.10/M input tokens, $4.40/M output tokens) Almost 10x the price.

LLMs are getting cheaper much faster than I anticipated. I'm curious if it's still the hype cycle and Groq/Fireworks/Cerebras are taking a loss here, or whether things are actually getting cheaper. At this we'll be able to run Qwen3-32B level models in phones/embedded soon.

tempaccount420|6 months ago

It's funny because I was thinking the opposite, the pricing seems way too high for a 5B parameter activation model.

mikepurvis|6 months ago

Are the prices staying aligned to the fundamentals (hardware, energy), or is this a VC-funded land grab pushing prices to the bottom?

spott|6 months ago

It is interesting that openai isn't offering any inference for these models.

bangaladore|6 months ago

Makes sense to me. Inference on these models will be a race to the bottom. Hosting inference themselves will be a waste of compute / dollar for them.

modeless|6 months ago

I really want to try coding with this at 2600 tokens/s (from Cerebras). Imagine generating thousands of lines of code as fast as you can prompt. If it doesn't work who cares, generate another thousand and try again! And at $.69/M tokens it would only cost $6.50 an hour.

andai|6 months ago

I tried this (gpt-oss-120b with Cerebras) with Roo Code. It repeatedly failed to use the tools correctly, and then I got 429 too many requests. So much for the "as fast as I can think" idea!

I'll have to try again later but it was a bit underwhelming.

The latency also seemed pretty high, not sure why. I think with the latency the throughout ends up not making much difference.

Btw Groq has the 20b model at 4000 TPS but I haven't tried that one.