Tried this out with Cline using my own API key (Cerebras is also available as a provider for Qwen3 Coder via via openrouter here: https://openrouter.ai/qwen/qwen3-coder) and realized that without caching, this becomes very expensive very quickly. Specifically, after each new tool call, you're sending the entire previous message history as input tokens - which are priced at $2/1M via the API just like output tokens.
The quality is also not quite what Claude Code gave me, but the speed is definitely way faster. If Cerebras supported caching & reduced token pricing for using the cache I think I would run this more, but right now it's too expensive per agent run.
Adding entire files into the context window and letting the AI sift through it is a very wasteful approach.
It was adopted because trying to generate diffs with AI opens a whole new can of worms, but there's a very efficient approach in between: slice the files on the symbol level.
So if the AI only needs the declaration of foo() and the definition of bar(), the entire file can be collapsed like this:
Any AI-suggested changes are then easy to merge back (renamings are the only notable exception), so it works really fast.
I am currently working on an editor that combines this approach with the ability to step back-and-forth between the edits, and it works really well. I absolutely love the Cerebras platform (they have a free tier directly and pay-as-you-go offering via OpenRouter). It can get very annoying refactorings done in one or two seconds based on single-sentence prompts, and it usually costs about half a cent per refactoring in tokens. Also great for things like applying known algorithms to spread out data structures, where including all files would kill the context window, but pulling individual types works just fine with a fraction of tokens.
Does caching make as much sense as a cost saving measure on Cerebras hardware as it does on mainstream GPU's? Caching should be preferred if SSD->VRAM is dramatically cheaper than recalculation. If Cerebras is optimized for massively parallel compute with fixed weights, and not a lot of memory bandwidth into or out of the big wafer, it might actually make sense to price per token without a caching discount. Could someone from the company (or otherwise familiar with it) comment on the tradeoff?
1,000 messages per day should be plenty as a daily development driver. I use claude code sonnet 4 exclusively and I do not send more than 1,000 messages per day. However, that is my current understanding. I am certainly not pressing enter 1,000 times! Maybe there are more messages being sent under the hood that I do not realize?
We're just doing usage-based pricing for our ai devtools product because it's the only way to square the circle of "as much access to an expensive thing as you want, at a reasonable price".
It's harder to set up, lends itself to lower margins, and consumers generally do prefer more predictable/simpler pricing, but so many ai devtools products have pissed their users off by throttling their "unlimited"/plan-based pricing that I think it's now seen as a yellow flag
Some users who signed up for pro ($50 p.m.) are reporting further limitations than those advertised.
>While they advertise a 1,000-request limit, the actual daily constraint is a 7.5 million-token limit. [1]
Assumes an average of 7.5k/request whereas in their marketing videos they show API requests ballooning by ~24k per request. Still lower than the API price.
Bait and switched their FAQ after the fact too. Come on Cerebras, it’s only VC money you’re burning here in the first place, let’s see some commitment to winning market share. :money: :fire:
Had a similar experience. I got rate limited as well even when I well below 1M tokens. When its working, it's nice, but can't use it as a replacement for Cursor until higher rate limits are granted.
If you would like to try this in a coding agent (we find the qwen3-coder model works really well in agents!), we have been experimenting with Cerebras Code in Sketch. We just pushed support, so you can run it with the latest version, 0.0.33:
2k tokens/second is insane. While I'm very much against vibe coding, such performance essentially means you can get near-github copilot level speed with drastically better quality.
I really wish Qwen3 folks put up an Anthropic-compatible API like the Kimi and GLM/Zai folks cleverly did — this makes their models trivially usable in Claude Code, via this dead-simple setup:
API Error: 422 {"error":{"message":"Error from provider: {\"message\":\"body.messages.0.system.content: Input should be a valid string\",\"type\":\"invalid_request_error\",\"param\":\"validation_error\",\"code\":\"wrong_api_format\"}
I'm so excited to see a real competitor to Claude Code! Gemini CLI, while decent, does not have a $200/month pricing model and they charge per API access - Codex is the same. I'm trying to get into the https://cloud.cerebras.ai/ to try the $50/month plan but I can't even get in.
Unless I'm misunderstanding something. Cerebras Code is not equivalent to Claude Code or Gemini CLI. It's a strange name for a subscription to access an API endpoint.
You take your Cerebras Code endpoint and configure XYZ CLI tool or IDE plugin to point at it.
I subscribed to the $50 plan. It's super fast for sure, but rate limits kick in after just a couple requests. completely defeating the fact that responses are fast.
This token throughput is incredible and going to set a new bar in the industry. The main issue with the cerebras code plan is that number of requests/minute is throttled, and with agentic coding systems each tool call is treated as new "message" so you can easily hit the api limits (10 messages/minute).
One workaround we're doing now that seems to work is use claude for all tasks but delegate specific tools with cerebras/qwen-3-coder-480b model to generate files or other token heavy tasks to avoid spiking the total number of requests. This has cost and latency consequences (and adds complexity to the code), but until those throttle limits are lifted seems to be a good combo. I also find that claude has better quality with tool selection when the number of tools required is > 15 which our current setup has.
My understanding is that the coding agents people use can be modified to plug into any LLM provider's API?
The difference here seems to be that Cerebras does not appear to have Qwen3-Coder through their API! So now there is a crazy fast (and apparently good too?) model that they only provide if you pay the crazy monthly sub?
The way I would use this $50 Cerebras offering is as a delegate for some high token count items like documentation, lint fixing, and other operations as a way not only to speed up the workflow but to release some back pressure on Anthropic/claude so you don’t hit your limits as quickly… especially with the new weekly throttle coming. This $50 dollar jump seems very reasonable, now for the 1k completions a day, id really want to see and get a feel for how chatty it is.
I suppose thats how it starts but id the model is competent and fast, the speed alone might force you a bit to delegate more to it. (Maybe sub agent tasks)
You can still get it pay-as-you-go on OpenRouter, afaict, and the billing section of the Cerebras Cloud account I just created has a section for Qwen3-Coder-480B as well.
Not really. With Opus 4 you will burn into the thousand a month with serious usage. I tested it yesterday and 5 hours of use was 60$. If I extrapolate that you will easily hit 1K+.
The usage limit on Cerebras Code is rather limited, $50 plan apparently gives you 7.5 million tokens per day which doesn't last long. This also isn't clearly advertised on the plans prior to purchasing.
Anyone get this working in Cursor? I can connect openrouter just fine, but Cerebras just errors out instantly. Same url/key works via curl, so some sort of Cerebras/Cursor compatibility issue.
While I’m also curious, I’m fine with having a mostly inferior alternative too. This is a dynamic market with some big players already; having more options is beneficial. If only as a way to prevent others from doing a rug pull.
I've been waiting on this for a LONG time. Integration with Cursor when Cerebras released their earlier models was patchy at best, even through openrouter. It's nice to finally see official support, although I'm a bit worried about long-term the time for bash mcp calls ending up dominating.
Still, definitely the right direction!
EDIT: doesn't seem like anything but a first-party api with a monthly plan.
I'm finding myself switching between subscriptions to ChatGPT, T3 Chat, DeepSeek, Claude Code etc. Their subscription models aren't compatible with making it easy to take your data with you. I wish I could try this out and import all my data.
If they can maintain this pricing level, and if Qwen3‑Coder is as good as people say then they will have an enormous hit on their hands. A massive money losing hit, but a hit.
Very interesting!
PS: Did they reduce the context window, it looks like it.
For $200plan, it has 40M token cap per day, so assuming the API pricing, the max usage per day is $12/day or 360 per month. (Assuming user max-out usage every day or doesn't hit the 1000message limit first)
relatively standard subscription pricing vs API pricing, i believe they are making money from this and counting on people compare this to Claude Code, which is a much more generous offer.
So for <$1.7/day I can hire a programmer at a sort-of Claude Sonnet 4 level? I know it's got its quirks, limits, and needs supervision, but it's like 20x cheaper than an average programmer.
How does context buildup work for the code generating machines generally ?
Do the programs just use human notes + current code directly ? Are there some specific ranking steps that need to be done ?
I had 9 seconds, earlier with Cline. That said, resulting output file I had requested generation of was over 122KB in 58.690 seconds, so I was approaching 2KB per second even factoring in high TTFT.
The high TTFT (around 5-6 seconds) is what kills the excitement for this for me. Sure, when it starts outputting its crazy fast so it’s good for generating single file prototypes, but as soon as you try to use it in Cline or any other agentic loop you’ll be waiting for API requests constantly and it’s a real bottleneck.
It says it works with your favorite IDE-- How do you (the reader) plan to use this? I use Cursor, but I'm not sure if this replaces my need to pay for Cursor, or if I need to pay for Cursor AND this, and add in the LLM?
Or is VS code pretty good at this point? Or is there something better? These are the only two ways I'd know how to actually consume this with any success.
any plugin that allows using an OpenAI compatible endpoint should work fine (eg; RooCode, Cline, etc. for VSCode).
Personally, I use code-companion on neovim.
Maybe not the best solution for vibe coders but for serious engineers using these tools for AI-assisted development, OpenAI API compatibility means total flexibility.
Flux159|7 months ago
The quality is also not quite what Claude Code gave me, but the speed is definitely way faster. If Cerebras supported caching & reduced token pricing for using the cache I think I would run this more, but right now it's too expensive per agent run.
sysmax|7 months ago
It was adopted because trying to generate diffs with AI opens a whole new can of worms, but there's a very efficient approach in between: slice the files on the symbol level.
So if the AI only needs the declaration of foo() and the definition of bar(), the entire file can be collapsed like this:
Any AI-suggested changes are then easy to merge back (renamings are the only notable exception), so it works really fast.I am currently working on an editor that combines this approach with the ability to step back-and-forth between the edits, and it works really well. I absolutely love the Cerebras platform (they have a free tier directly and pay-as-you-go offering via OpenRouter). It can get very annoying refactorings done in one or two seconds based on single-sentence prompts, and it usually costs about half a cent per refactoring in tokens. Also great for things like applying known algorithms to spread out data structures, where including all files would kill the context window, but pulling individual types works just fine with a fraction of tokens.
If you don't mind the shameless plug, there's a more explanation how it works here: https://sysprogs.com/CodeVROOM/documentation/concepts/symbol...
seunosewa|7 months ago
The API price is not a reason to reject the subscription price.
Havoc|7 months ago
waldrews|7 months ago
BenGosub|7 months ago
beastman82|7 months ago
In fact it seems obvious that you should use the flat fee model instead
thanhhaimai|7 months ago
I was excited, then I read this:
> Send up to 1,000 messages per day—enough for 3–4 hours of uninterrupted vibe coding.
I don't mind paying for services I use. But it's hard to take this seriously when the first paragraph claim is contradicting the fine prints.
superasn|7 months ago
[1] https://www.viberank.app/
sneilan1|7 months ago
attentive|7 months ago
weitendorf|7 months ago
It's harder to set up, lends itself to lower margins, and consumers generally do prefer more predictable/simpler pricing, but so many ai devtools products have pissed their users off by throttling their "unlimited"/plan-based pricing that I think it's now seen as a yellow flag
kristjansson|7 months ago
amirhirsch|7 months ago
Palmik|7 months ago
unraveller|7 months ago
>While they advertise a 1,000-request limit, the actual daily constraint is a 7.5 million-token limit. [1]
Assumes an average of 7.5k/request whereas in their marketing videos they show API requests ballooning by ~24k per request. Still lower than the API price.
[1] https://old.reddit.com/r/LocalLLaMA/comments/1mfeazc/cerebra...
itsafarqueue|7 months ago
nickandbro|7 months ago
apwell23|7 months ago
crawshaw|7 months ago
alfalfasprout|7 months ago
For in-editor use that's game changing.
itsafarqueue|7 months ago
exclipy|7 months ago
https://x.com/windsurf/status/1951340259192742063
bluelightning2k|7 months ago
namanyayg|7 months ago
I think a lot more companies will follow suit and the competition will make pricing much better for the end user.
congrats on the launch Cerebras team!
ktsakas|7 months ago
d4rkp4ttern|7 months ago
https://github.com/pchalasani/claude-code-tools?tab=readme-o...
amirhirsch|7 months ago
sneilan1|7 months ago
bangaladore|7 months ago
You take your Cerebras Code endpoint and configure XYZ CLI tool or IDE plugin to point at it.
wordofx|7 months ago
lvl155|7 months ago
no_flaks_given|7 months ago
They shat the bed. They went for super crazy fast compute and not much memory, assuming that models would plateu at a fee billion parameters.
Last year 70b parameters was considered huge, and a good place to standardize around.
Today we have 1t parameter models and we know it still scales linearly with parameters.
So next year we might have 10T parameter LLMs and these guys will still be playing catch up.
All that matters for inference right now is how many HBM chips you can stack and that's it
dmitrygr|7 months ago
jedisct1|7 months ago
I subscribed to the $50 plan. It's super fast for sure, but rate limits kick in after just a couple requests. completely defeating the fact that responses are fast.
Did I miss something?
TacticalCoder|7 months ago
[deleted]
attentive|7 months ago
Any attempt to deal with "<think>" in the code gets it replaced with "<tool_call>".
Both in inference.cerebras.ai chat and API.
Same model on chat.qwen.ai doesn't do it.
segmondy|7 months ago
rbitar|7 months ago
One workaround we're doing now that seems to work is use claude for all tasks but delegate specific tools with cerebras/qwen-3-coder-480b model to generate files or other token heavy tasks to avoid spiking the total number of requests. This has cost and latency consequences (and adds complexity to the code), but until those throttle limits are lifted seems to be a good combo. I also find that claude has better quality with tool selection when the number of tools required is > 15 which our current setup has.
unknown|7 months ago
[deleted]
sophia01|7 months ago
The difference here seems to be that Cerebras does not appear to have Qwen3-Coder through their API! So now there is a crazy fast (and apparently good too?) model that they only provide if you pay the crazy monthly sub?
social_quotient|7 months ago
The way I would use this $50 Cerebras offering is as a delegate for some high token count items like documentation, lint fixing, and other operations as a way not only to speed up the workflow but to release some back pressure on Anthropic/claude so you don’t hit your limits as quickly… especially with the new weekly throttle coming. This $50 dollar jump seems very reasonable, now for the 1k completions a day, id really want to see and get a feel for how chatty it is.
I suppose thats how it starts but id the model is competent and fast, the speed alone might force you a bit to delegate more to it. (Maybe sub agent tasks)
pxc|7 months ago
baq|7 months ago
it's two kilotokens per second. that's fast.
clbrmbr|7 months ago
rowanG077|7 months ago
ixel|7 months ago
d3vr|7 months ago
scosman|7 months ago
dlojudice|7 months ago
> Yeah I filed a ticket with Cursor
> They have problems with OpenAI customization
saberience|7 months ago
Who is the intended audience for Cerebras?
ritenuto|7 months ago
JackYoustra|7 months ago
Still, definitely the right direction!
EDIT: doesn't seem like anything but a first-party api with a monthly plan.
deevus|7 months ago
HardCodedBias|7 months ago
If they can maintain this pricing level, and if Qwen3‑Coder is as good as people say then they will have an enormous hit on their hands. A massive money losing hit, but a hit.
Very interesting!
PS: Did they reduce the context window, it looks like it.
kristopolous|7 months ago
The $200/month is their "poor person" product for people who can't shell out $500k on one of their rigs.
https://www.cerebras.ai/system
ahmadyan|7 months ago
For $200plan, it has 40M token cap per day, so assuming the API pricing, the max usage per day is $12/day or 360 per month. (Assuming user max-out usage every day or doesn't hit the 1000message limit first)
relatively standard subscription pricing vs API pricing, i believe they are making money from this and counting on people compare this to Claude Code, which is a much more generous offer.
UnPerson-Alpha2|7 months ago
hereme888|7 months ago
tbarbugli|7 months ago
another_twist|7 months ago
unshavedyak|7 months ago
lxe|7 months ago
d3vr|7 months ago
Roo Code support added in v3.25.5: https://github.com/RooCodeInc/Roo-Code/releases/tag/v3.25.5
Cerebras has also been added as a provider for Qwen 3 Coder in OpenRouter: https://openrouter.ai/qwen/qwen3-coder?sort=throughput
dpkirchner|7 months ago
anonym29|7 months ago
M4v3R|7 months ago
txyx303|7 months ago
atkailash|7 months ago
scosman|7 months ago
Consumer-Basics|7 months ago
cellis|7 months ago
anonym29|7 months ago
romanovcode|7 months ago
No weekly limits so far. Just you wait if you get same or more traction as Claude you are going to go same playbook as they did.
knicholes|7 months ago
Or is VS code pretty good at this point? Or is there something better? These are the only two ways I'd know how to actually consume this with any success.
alfalfasprout|7 months ago
Personally, I use code-companion on neovim.
Maybe not the best solution for vibe coders but for serious engineers using these tools for AI-assisted development, OpenAI API compatibility means total flexibility.
esafak|7 months ago
supernova8|7 months ago
kristopolous|7 months ago
unshavedyak|7 months ago
Claude and Gemini have similar offerings for a similar/same price, i thought. Eg if Claude Code can do it for $200/m, why can't Cerebras?
(honest question, trying to understand the challenge for Cerebras that you're pointing to)
edit: Maybe it's the speed? 2k tokens/s sounds... fast, much faster than Claude. Is that what you're referring to?
meepmorp|7 months ago
evrennetwork|7 months ago
[deleted]
hexagrams64|7 months ago
[deleted]
dude250711|7 months ago
dang|7 months ago
"Don't be curmudgeonly."
https://news.ycombinator.com/newsguidelines.html
andrewmutz|7 months ago
reactordev|7 months ago
fishsticks89|7 months ago