top | item 44762959

Cerebras Code

449 points| d3vr | 7 months ago |cerebras.ai

172 comments

order

Flux159|7 months ago

Tried this out with Cline using my own API key (Cerebras is also available as a provider for Qwen3 Coder via via openrouter here: https://openrouter.ai/qwen/qwen3-coder) and realized that without caching, this becomes very expensive very quickly. Specifically, after each new tool call, you're sending the entire previous message history as input tokens - which are priced at $2/1M via the API just like output tokens.

The quality is also not quite what Claude Code gave me, but the speed is definitely way faster. If Cerebras supported caching & reduced token pricing for using the cache I think I would run this more, but right now it's too expensive per agent run.

sysmax|7 months ago

Adding entire files into the context window and letting the AI sift through it is a very wasteful approach.

It was adopted because trying to generate diffs with AI opens a whole new can of worms, but there's a very efficient approach in between: slice the files on the symbol level.

So if the AI only needs the declaration of foo() and the definition of bar(), the entire file can be collapsed like this:

  class MyClass {
    void foo();
    
    void bar() {
        //code
    }
  }
Any AI-suggested changes are then easy to merge back (renamings are the only notable exception), so it works really fast.

I am currently working on an editor that combines this approach with the ability to step back-and-forth between the edits, and it works really well. I absolutely love the Cerebras platform (they have a free tier directly and pay-as-you-go offering via OpenRouter). It can get very annoying refactorings done in one or two seconds based on single-sentence prompts, and it usually costs about half a cent per refactoring in tokens. Also great for things like applying known algorithms to spread out data structures, where including all files would kill the context window, but pulling individual types works just fine with a fraction of tokens.

If you don't mind the shameless plug, there's a more explanation how it works here: https://sysprogs.com/CodeVROOM/documentation/concepts/symbol...

seunosewa|7 months ago

The Cerebras.ai plan offers a flat fee of $50 or $200.

The API price is not a reason to reject the subscription price.

Havoc|7 months ago

This seems to be rate limited by message not token so the lack of cache may matter less

waldrews|7 months ago

Does caching make as much sense as a cost saving measure on Cerebras hardware as it does on mainstream GPU's? Caching should be preferred if SSD->VRAM is dramatically cheaper than recalculation. If Cerebras is optimized for massively parallel compute with fixed weights, and not a lot of memory bandwidth into or out of the big wafer, it might actually make sense to price per token without a caching discount. Could someone from the company (or otherwise familiar with it) comment on the tradeoff?

BenGosub|7 months ago

If they say it costs $50 per month, why do you need to make additional payments?

beastman82|7 months ago

the API price is not very relevant to this flat fee service announcement.

In fact it seems obvious that you should use the flat fee model instead

thanhhaimai|7 months ago

> running at speeds of up to 2,000 tokens per second, with a 131k-token context window, no proprietary IDE lock-in, and no weekly limits!

I was excited, then I read this:

> Send up to 1,000 messages per day—enough for 3–4 hours of uninterrupted vibe coding.

I don't mind paying for services I use. But it's hard to take this seriously when the first paragraph claim is contradicting the fine prints.

sneilan1|7 months ago

1,000 messages per day should be plenty as a daily development driver. I use claude code sonnet 4 exclusively and I do not send more than 1,000 messages per day. However, that is my current understanding. I am certainly not pressing enter 1,000 times! Maybe there are more messages being sent under the hood that I do not realize?

attentive|7 months ago

To put this into perspective, github copilot Business license is 300 "premium" requests a MONTH.

weitendorf|7 months ago

We're just doing usage-based pricing for our ai devtools product because it's the only way to square the circle of "as much access to an expensive thing as you want, at a reasonable price".

It's harder to set up, lends itself to lower margins, and consumers generally do prefer more predictable/simpler pricing, but so many ai devtools products have pissed their users off by throttling their "unlimited"/plan-based pricing that I think it's now seen as a yellow flag

kristjansson|7 months ago

It’s a true statement - no weekly limits, just a daily limit. Easier to work with when you can only get locked out of your tool for 23h59m

amirhirsch|7 months ago

the distinction is from weekly limits of claude code.

Palmik|7 months ago

Yes, to differentiate from Claude Code which has 5-hour-window limits as well as weekly limits on top

unraveller|7 months ago

Some users who signed up for pro ($50 p.m.) are reporting further limitations than those advertised.

>While they advertise a 1,000-request limit, the actual daily constraint is a 7.5 million-token limit. [1]

Assumes an average of 7.5k/request whereas in their marketing videos they show API requests ballooning by ~24k per request. Still lower than the API price.

[1] https://old.reddit.com/r/LocalLLaMA/comments/1mfeazc/cerebra...

itsafarqueue|7 months ago

Bait and switched their FAQ after the fact too. Come on Cerebras, it’s only VC money you’re burning here in the first place, let’s see some commitment to winning market share. :money: :fire:

nickandbro|7 months ago

Had a similar experience. I got rate limited as well even when I well below 1M tokens. When its working, it's nice, but can't use it as a replacement for Cursor until higher rate limits are granted.

crawshaw|7 months ago

If you would like to try this in a coding agent (we find the qwen3-coder model works really well in agents!), we have been experimenting with Cerebras Code in Sketch. We just pushed support, so you can run it with the latest version, 0.0.33:

  brew install boldsoftware/tap/sketch
  CEREBRAS_API_KEY=...
  sketch --model=qwen3-coder-cerebras -skaband-addr=
Our experience is it seems overloaded right now, to the point where we have better results with our usual hosted version:

  sketch --model=qwen

alfalfasprout|7 months ago

2k tokens/second is insane. While I'm very much against vibe coding, such performance essentially means you can get near-github copilot level speed with drastically better quality.

For in-editor use that's game changing.

itsafarqueue|7 months ago

At full pace that means 62 mins until you hit the daily cap.

namanyayg|7 months ago

I was waiting for more subscription base services to pop up to compete with the influence provider on a commodities level.

I think a lot more companies will follow suit and the competition will make pricing much better for the end user.

congrats on the launch Cerebras team!

ktsakas|7 months ago

Does it work with claude-code-router? I was getting API errors this week trying to use qwen3 Cerebras through OpenRouter with Claude code router.

amirhirsch|7 months ago

API Error: 422 {"error":{"message":"Error from provider: {\"message\":\"body.messages.0.system.content: Input should be a valid string\",\"type\":\"invalid_request_error\",\"param\":\"validation_error\",\"code\":\"wrong_api_format\"}

sneilan1|7 months ago

I'm so excited to see a real competitor to Claude Code! Gemini CLI, while decent, does not have a $200/month pricing model and they charge per API access - Codex is the same. I'm trying to get into the https://cloud.cerebras.ai/ to try the $50/month plan but I can't even get in.

bangaladore|7 months ago

Unless I'm misunderstanding something. Cerebras Code is not equivalent to Claude Code or Gemini CLI. It's a strange name for a subscription to access an API endpoint.

You take your Cerebras Code endpoint and configure XYZ CLI tool or IDE plugin to point at it.

wordofx|7 months ago

This doesn’t feel like a competitor. Amp does tho.

lvl155|7 months ago

Their hardware is incredible. Why aren’t more investors lining up for this in this environment?

no_flaks_given|7 months ago

This model is super quantized and the quality isn't great, but that's necessary because just like everyone else except for Nvidia and AMD

They shat the bed. They went for super crazy fast compute and not much memory, assuming that models would plateu at a fee billion parameters.

Last year 70b parameters was considered huge, and a good place to standardize around.

Today we have 1t parameter models and we know it still scales linearly with parameters.

So next year we might have 10T parameter LLMs and these guys will still be playing catch up.

All that matters for inference right now is how many HBM chips you can stack and that's it

dmitrygr|7 months ago

Contradictions do not exist. Whenever you think that you are facing a contradiction, check your premises. You will find that one of them is wrong.

jedisct1|7 months ago

I'm a little bit confused.

I subscribed to the $50 plan. It's super fast for sure, but rate limits kick in after just a couple requests. completely defeating the fact that responses are fast.

Did I miss something?

attentive|7 months ago

Attn: Cerebras

Any attempt to deal with "<think>" in the code gets it replaced with "<tool_call>".

Both in inference.cerebras.ai chat and API.

Same model on chat.qwen.ai doesn't do it.

rbitar|7 months ago

This token throughput is incredible and going to set a new bar in the industry. The main issue with the cerebras code plan is that number of requests/minute is throttled, and with agentic coding systems each tool call is treated as new "message" so you can easily hit the api limits (10 messages/minute).

One workaround we're doing now that seems to work is use claude for all tasks but delegate specific tools with cerebras/qwen-3-coder-480b model to generate files or other token heavy tasks to avoid spiking the total number of requests. This has cost and latency consequences (and adds complexity to the code), but until those throttle limits are lifted seems to be a good combo. I also find that claude has better quality with tool selection when the number of tools required is > 15 which our current setup has.

sophia01|7 months ago

My understanding is that the coding agents people use can be modified to plug into any LLM provider's API?

The difference here seems to be that Cerebras does not appear to have Qwen3-Coder through their API! So now there is a crazy fast (and apparently good too?) model that they only provide if you pay the crazy monthly sub?

social_quotient|7 months ago

Exactly! You can use tools like https://github.com/musistudio/claude-code-router which let you use other LLMs.

The way I would use this $50 Cerebras offering is as a delegate for some high token count items like documentation, lint fixing, and other operations as a way not only to speed up the workflow but to release some back pressure on Anthropic/claude so you don’t hit your limits as quickly… especially with the new weekly throttle coming. This $50 dollar jump seems very reasonable, now for the 1k completions a day, id really want to see and get a feel for how chatty it is.

I suppose thats how it starts but id the model is competent and fast, the speed alone might force you a bit to delegate more to it. (Maybe sub agent tasks)

pxc|7 months ago

You can still get it pay-as-you-go on OpenRouter, afaict, and the billing section of the Cerebras Cloud account I just created has a section for Qwen3-Coder-480B as well.

baq|7 months ago

define 'crazy'.

it's two kilotokens per second. that's fast.

clbrmbr|7 months ago

At $200/month the comparable should be Opus 4 not Sonnet 4.

rowanG077|7 months ago

Not really. With Opus 4 you will burn into the thousand a month with serious usage. I tested it yesterday and 5 hours of use was 60$. If I extrapolate that you will easily hit 1K+.

ixel|7 months ago

The usage limit on Cerebras Code is rather limited, $50 plan apparently gives you 7.5 million tokens per day which doesn't last long. This also isn't clearly advertised on the plans prior to purchasing.

d3vr|7 months ago

Yeah really disappointing, hopefully they'll reconsider this limit because it really isn't usable, especially with "agentic tools" (e.g: opencode) ..

scosman|7 months ago

Anyone get this working in Cursor? I can connect openrouter just fine, but Cerebras just errors out instantly. Same url/key works via curl, so some sort of Cerebras/Cursor compatibility issue.

dlojudice|7 months ago

Same here. Got this msg on the Celebras discord:

> Yeah I filed a ticket with Cursor

> They have problems with OpenAI customization

saberience|7 months ago

Ok it's fast, but rate limits seem to kick in extremely quickly and the results are less good than Claude Code and it ends up more expensive?

Who is the intended audience for Cerebras?

ritenuto|7 months ago

While I’m also curious, I’m fine with having a mostly inferior alternative too. This is a dynamic market with some big players already; having more options is beneficial. If only as a way to prevent others from doing a rug pull.

JackYoustra|7 months ago

I've been waiting on this for a LONG time. Integration with Cursor when Cerebras released their earlier models was patchy at best, even through openrouter. It's nice to finally see official support, although I'm a bit worried about long-term the time for bash mcp calls ending up dominating.

Still, definitely the right direction!

EDIT: doesn't seem like anything but a first-party api with a monthly plan.

deevus|7 months ago

I'm finding myself switching between subscriptions to ChatGPT, T3 Chat, DeepSeek, Claude Code etc. Their subscription models aren't compatible with making it easy to take your data with you. I wish I could try this out and import all my data.

HardCodedBias|7 months ago

This has to be a monstrous money loser.

If they can maintain this pricing level, and if Qwen3‑Coder is as good as people say then they will have an enormous hit on their hands. A massive money losing hit, but a hit.

Very interesting!

PS: Did they reduce the context window, it looks like it.

kristopolous|7 months ago

They are a hardware company. They have a custom chip they are running it on.

The $200/month is their "poor person" product for people who can't shell out $500k on one of their rigs.

https://www.cerebras.ai/system

ahmadyan|7 months ago

Why?

For $200plan, it has 40M token cap per day, so assuming the API pricing, the max usage per day is $12/day or 360 per month. (Assuming user max-out usage every day or doesn't hit the 1000message limit first)

relatively standard subscription pricing vs API pricing, i believe they are making money from this and counting on people compare this to Claude Code, which is a much more generous offer.

UnPerson-Alpha2|7 months ago

Honest ? What are you thinking in terms of cost structure that makes you sure it is a money loser? Can you break down your assumptions.

hereme888|7 months ago

So for <$1.7/day I can hire a programmer at a sort-of Claude Sonnet 4 level? I know it's got its quirks, limits, and needs supervision, but it's like 20x cheaper than an average programmer.

tbarbugli|7 months ago

ofc it depends where you would hire, for me (NL) its above 100x more efficient

another_twist|7 months ago

How does context buildup work for the code generating machines generally ? Do the programs just use human notes + current code directly ? Are there some specific ranking steps that need to be done ?

unshavedyak|7 months ago

Super curious to see some comparisons to claude code. Especially Opus, since they're primarily comparing it to Sonnet in that graph.

dpkirchner|7 months ago

For those that have tried this, what kind of time-to-first-token latency are you seeing?

anonym29|7 months ago

I had 9 seconds, earlier with Cline. That said, resulting output file I had requested generation of was over 122KB in 58.690 seconds, so I was approaching 2KB per second even factoring in high TTFT.

M4v3R|7 months ago

The high TTFT (around 5-6 seconds) is what kills the excitement for this for me. Sure, when it starts outputting its crazy fast so it’s good for generating single file prototypes, but as soon as you try to use it in Cline or any other agentic loop you’ll be waiting for API requests constantly and it’s a real bottleneck.

txyx303|7 months ago

feels very low compared to claude/gpt for me

atkailash|7 months ago

I use regular cerebras for plan stage in cline, so I’m very excited to try this out

scosman|7 months ago

Groq also probably has this in the works. Fun times.

cellis|7 months ago

What are the token prices?

anonym29|7 months ago

$2/Mtok in and out but no caching discounts

romanovcode|7 months ago

> and no weekly limits!

No weekly limits so far. Just you wait if you get same or more traction as Claude you are going to go same playbook as they did.

knicholes|7 months ago

It says it works with your favorite IDE-- How do you (the reader) plan to use this? I use Cursor, but I'm not sure if this replaces my need to pay for Cursor, or if I need to pay for Cursor AND this, and add in the LLM?

Or is VS code pretty good at this point? Or is there something better? These are the only two ways I'd know how to actually consume this with any success.

alfalfasprout|7 months ago

any plugin that allows using an OpenAI compatible endpoint should work fine (eg; RooCode, Cline, etc. for VSCode).

Personally, I use code-companion on neovim.

Maybe not the best solution for vibe coders but for serious engineers using these tools for AI-assisted development, OpenAI API compatibility means total flexibility.

esafak|7 months ago

They should just host all the latest open source models FTW.

supernova8|7 months ago

How is this even possible?

unshavedyak|7 months ago

Incase i'm missing something, why wouldn't it be possible?

Claude and Gemini have similar offerings for a similar/same price, i thought. Eg if Claude Code can do it for $200/m, why can't Cerebras?

(honest question, trying to understand the challenge for Cerebras that you're pointing to)

edit: Maybe it's the speed? 2k tokens/s sounds... fast, much faster than Claude. Is that what you're referring to?

meepmorp|7 months ago

They make frisbee-sized CPUs.

dude250711|7 months ago

[flagged]

andrewmutz|7 months ago

If you review every change as it goes, vibecoded results are often better than human-only and written much faster

reactordev|7 months ago

Nah, we’ll have a Legacy Coder agent to fix vibe coding agents so you’ll be supervising those. Yey…

fishsticks89|7 months ago

It will just be replaced by more vibe code in the future. Code is like toilet paper now.