top | item 45429098

(no title)

maz1b | 5 months ago

Cerebras has been a true revelation when it comes to inference. I have a lot of respect for their founder, team, innovation, and technology. The colossal size of the WS3 chip, utilizing DRAM to mind-boggling scale, it's definitely ultra cool stuff.

I also wonder why they have not been acquired yet. Or is it intentional?

I will say, their pricing and deployment strategy is a bit murky and unclear. Paying $1500-$10,000 per month plus usage costs? I'm assuming that it has to do with chasing and optimizing for higher value contracts and deeper-pocketed customers, hence the minimum monthly spend that they require.

I'm not claiming to be an expert, but as a CEO/CTO, there were other providers in the market that had relatively comparable inference speed (obviously Cerebras is #1), easier onboarding, better response from people that worked there (all of my experience with Cerebras have been days/weeks late or simply ignored). IMHO, if Cerebras wants to gain more mindshare, they'll have to look into this aspect.

discuss

aurareturn|5 months ago

  I also wonder why they have not been acquired yet. Or is it intentional?

A few issues:

1. To achieve high speeds, they put everything on SRAM. I estimated that they needed over $100m of chips just to do Qwen 3 at max context size. You can run the same model with max context size on $1m of Blackwell chips but at a slower speed. Anandtech had an article saying that Cerebras was selling a single chip for around $2-3m. https://news.ycombinator.com/item?id=44658198

2. SRAM has virtually stopped scaling in new nodes. Therefore, new generations of wafer scale chips won’t gain as much as traditional GPUs.

3. Cerebras was designed in the pre-ChatGPT era where much smaller models were being trained. It is practically useless for training in 2025 because of how big LLMs have gotten. It can only do inference but see above 2 problems.

4. To inference very large LLMs economically, Cerebras would need to use external HBM. If it has to reach outside for memory, the benefits of a wafer scale chip greatly diminishes. Remember that the whole idea was to put the entire AI model inside the wafer so memory bandwidth is ultra fast.

5. Chip interconnect technology might make wafer scale chips more redundant. TSMC has a roadmap for glueing more than 2 GPU dies together. Nvidia’s Feynman GPUs might have 4 dies glued together. IE, the sweet spot for large chips might not be wafer scale but perhaps 2, 4, 8 GPUs together.

6. Nvidia seems to be moving much faster in terms of development and responding to market needs. For example, Blackwell is focused on FP4 inferencing now. I suppose the nature of designing and building a wafer scale chip is more complex than a GPU. Cerebras also needs to wait for new nodes to fully mature so that yields can be higher.

There exists a niche where some applications might need super fast token generation regardless of price. Hedge funds and Wallstreet might be good use cases. But it won’t challenge Nvidia in training or large scale inference.

sailingparrot|5 months ago

> I estimated that they needed over $100m of chips just to do Qwen 3 at max context size

I will point out (again :)), that this math is completely wrong. There is no need (nor performance gains) to store the entire weights of the model in SRAM. You simply store n transformer blocks on-chip and then stream block l+n from external memory to on-chip when you start computing block l, this completely masks the communication time behind the compute time, and specifically does not require you to buy 100M$ worth of SRAM. This is standard stuff that is done routinely in many scenarios, e.g. FSDP.

https://www.cerebras.ai/blog/cerebras-software-release-2.0-5...

addaon|5 months ago

> SRAM has virtually stopped scaling in new nodes.

But there are several 1T memories that are still scaling, more or less — eDRAM, MRAM, etc. Is there anything preventing their general architecture from moving to a 1T technology once the density advantages outweigh the need for pipelining to hide access time?

sinuhe69|5 months ago

No, only Groq uses the all SRAM approach, Cerebras only use SRAM for local context while the weights are still loaded from RAM (or HBM). With 48 Kbytes per node, the whole wafer has only 44 GB SDRAM, much lower than the amount needed for loading the whole networks.

arisAlexis|5 months ago

But apparently they serve all the models super fast and in production so you must be wrong somewhere

oceanplexian|5 months ago

I’ve been using them as a customer and have been fairly impressed. The thing is, a lot of inference providers might seem better on paper but it turns out they’re not.

Recently there was a fiasco I saw posted on r/localllama where many of the OpenRouter providers were degraded on benchmarks compared to base models, implying they are serving up quantized models to save costs, but lying to customers about it. Unless you’re actually auditing the tokens you’re purchasing you may not be getting what you’re paying for even if the T/s and $/token seems better.

dlojudice|5 months ago

OpenRouter should be responsible for this quality control, right? It seems to me to be the right player in the chain with the duties and scale to do so.

teruakohatu|5 months ago

> many of the OpenRouter providers were degraded on benchmarks compared to base models, implying they are serving up quantized models to save costs,

Do you have information on this? This seems like brand destroying for both OpenRouter and the model providers.

throw123890423|5 months ago

> I will say, their pricing and deployment strategy is a bit murky and unclear. Paying $1500-$10,000 per month plus usage costs? I'm assuming that it has to do with chasing and optimizing for higher value contracts and deeper-pocketed customers, hence the minimum monthly spend that they require.

Yeah wait, why rent chips instead of sell them? Why wouldn't customers want to invest money in competition for cheaper inference hardware? It's not like Nvidia has a blacklist of companies that have bought chips from competitors, or anything. Now that would be crazy! That sure would make this market tough to compete in, wouldn't it. I'm so glad Nvidia is definitely not pressuring companies to not buy from competitors or anything.

aurareturn|5 months ago

Their chips weren’t selling because:

1. They’re useless for training in 2025. They were designed for training prior to LLM explosion. They’re not practical for training anymore because they rely on SRAM which is not scalable.

2. No one is going to spend the resources to optimize models to run on their SDK and hardware. Open source inference engines don’t optimize for Cerebras hardware.

Given the above two reasons, it makes a lot of sense that no one is investing in their hardware and they have switched to a cloud model selling speed as the differentiator.

It’s not always “Nvidia bad”.

OkayPhysicist|5 months ago

The UAE has sunk a lot of money into them, and I suspect it's not purely a financial move. If that's the case, an acquisition might be more complicated than it would seem at first glance.

nsteel|5 months ago

> utilizing DRAM to mind-boggling scale

I thought it was the SRAM scaling that was impressive, no?

maz1b|5 months ago

oops, typo! S and D are next to each other on the keyboard. thanks for pointing this out

liuliu|5 months ago

They were acquisition target since 2017 (from the OpenAI internal emails). So lacking of acquisition is not because lacking of interests. Let you wonder what happened in these due-diligence.