top | item 45429742

(no title)

allisdust | 5 months ago

Nothing (may be except groq ?) comes even close to Cerebras in inference speed. I seriously don't get why these guys aren't more popular. The difference in using them as a inference provider vs anything else for any use case is like night and day. I hope more inference providers focus on speed. And this is where AMZN will benefit a lot since their entire cloud model is to have something people would anyway want and mark it up by 3x. God forbid if AVGO acquires this.

discuss

xadhominemx|5 months ago

Cerebras hasn’t made any technical breakthroughs, they are just putting everything in SRAM. It’s a brute force approach to get very high inference throughput but comes at extremely high cost per token per second and is not useful for batched inferencing. Groq uses the same approach.

Memory hierarchy management across HBM/DDR/Flash is much more difficult but necessary to achieve practical inference economics.

twothreeone|5 months ago

I don't think you realize the history of wafer-scale integration and what it means for the chip industry [1]. The approach was famously taken by Gene Amdahl's Trilogy Systems in the 80ies, but failed dramatically leading to (among others) deployment of "accelerator cards" in the form of.. the NVIDIA GeForce 256, the first GPU in 1999. It's not like NVIDIA hasn't been trying to integrate multiple dies in the same package, but doing that successfully has been a huge technological hurdle so far.

[1] https://ieeexplore.ieee.org/abstract/document/9623424

reliabilityguy|5 months ago

Optimizing for one metric only, e.g., speed, leads to suboptimal outcomes, e.g., cost, scalability, etc.

I think while being fast, cerebra’s probably not very economical in fleets at scale.