According to the company, the new chip will enable training of AI models with up to 24 trillion parameters. Let me repeat that, in case you're as excited as I am: 24. Trillion. Parameters. For comparison, the largest AI models currently in use have around 0.5 trillion parameters, around 48x times smaller.
Each parameter is a connection between artificial neurons. For example, inside an AI model, a linear layer that transforms an input vector with 1024 elements to an output vector with 2048 elements has 1024×2048 = ~2M parameters in a weight matrix. Each parameter specifies by how much each element in the input vector contributes to or subtracts from each element in the output vector. Each output vector element is a weighted sum (AKA a linear combination), of each input vector element.
A human brain has an estimated 100-500 trillion synapses connecting biological neurons. Each synapse is quite a complicated biological structure[a], but if we oversimplify things and assume that every synapse can be modeled as a single parameter in a weight matrix, then the largest AI models in use today have approximately 100T to 500T ÷ 0.5T = 200x to 1000x fewer connections between neurons that the human brain. If the company's claims prove true, this new chip will enable training of AI models that have only 4x to 20x fewer connections that the human brain.
> but if we oversimplify things and assume that every synapse can be modeled as a single parameter in a weight matrix
Which, it probably can't... but offsetting those simplifications and 4-20x difference is the massive difference in how quickly those synapses can be activated.
"CSL allows for compile time execution of code blocks that take compile-time constant
objects as input, a powerful feature it inherits from Zig, on which CSL is based. CSL will be largely
familiar to anyone who is comfortable with C/C++, but there are some new capabilities on top of
the C-derived basics."
If you were to add up all transistors fabricated worldwide, up until <year>, such that total roughly matches the # on this beast, what year would you arrive? Hell, throw in discrete transistors if you want.
How many early supercomputers / workstations etc would that include? How much progress did humanity make using all those early machines (or any transistorized device!) combined?
My guess is the titles get auto adjusted by Hacker News, but the script that does it doesn’t have logic for a trillion and only goes up to a billion, hence the weirdness of a string match and replace
Billion is a word for a large number, and it has two distinct definitions:
1,000,000,000, i.e. one thousand million, or 10^9 (ten to the ninth power), as defined on the short scale. This is now the most common sense of the word in all varieties of English; it has long been established in American English and has since become common in Britain and other English-speaking countries as well.
1,000,000,000,000, i.e. one million million, or 10^12 (ten to the twelfth power), as defined on the long scale. This number is the historical sense of the word and remains the established sense of the word in other European languages. Though displaced by the short scale definition relatively early in US English, it remained the most common sense of the word in Britain until the 1950s and still remains in occasional use there.
To quote their official response "If the WSE weren't rectangular, the complexity of power delivery, I/O, mechanical integrity and cooling become much more difficult, to the point of impracticality.".
As I understand it, WSE-2 was kind of handicapped because its performance could only really be harnessed if the neural net fit in the on-chip SRAM. Bandwidth to off-chip memory (normalized to FLOPS) was not as high as Nvidia. Is that improved with WSE-3? Seems like the SRAM is only 10% bigger, so that's not helping.
In the days before LLMs 44 GB of SRAM sounded like a lot, but these days it's practically nothing. It's possible that novel architectures could be built for Cerebras that leverage the unique capabilities, but the inaccessibility of the hardware is a problem. So few people will ever get to play with one that it's unlikely new architectures will be developed for it.
That was more of a WSE-1 problem maybe? They switched to a new compute paradigm (details on their site if you look up "weight streaming") where they basically store the activation on the wafer instead of the whole model. For something very large (say, 32K context and 16k hidden dimension) this would make an activation layer only 1-2GB (16 bit or 32 bit). As I understand it, this was one of the key changes needed to go from single system boxes to these super computing clusters they have been able to deploy.
The Nvidia bandwidth to compute ratio is more necessary because they are moving things around all the time. By keeping all the outputs on the wafer and only streaming the weights, you have a much more favorable requirement for BW to compute. And the number of layers becomes less impactful because they are storing transient outputs.
This is probably one of the primary reasons they didn't need to increase SRAM for WSE-3. WSE-2 was developed based on the old "fit the whole model on the chip" paradigm but models eclipsed 1TB so the new solution is more scalable.
As I understand it, the WSE-2's interconnect is actually quite good, and models are split across chips kinda like GPUs.
And keep in mind that these nodes are hilariously "fat" compared to a GPU node (or even an 8x GPU node), meaning less congestion and overhead from the topology.
One thing I don't understand about their architecture is that they have spent so much effort building this monster of a chip, but if you are going to do something crazy, why not work on processing in memory instead? At least for transformers you will primarily be bottlenecked on matrix multiplication and almost nothing else, so you only need to add a simple matrix vector unit behind your address decoder and then almost every AI accelerator will become obsolete over night. I wouldn't suggest this to a random startup though.
Hm, let's wait and see what the gemm/W perf is, and how many programmer hours it takes to implement say an mlp. Wafer scale data flow may not be a solved problem?
Interesting. I know there's a lot of attempts to hobble China by limiting their access to cutting edge chips and semiconductor manufacturing technology, but could something like this be a workaround for them, at least for datacenter-type jobs?
Maybe it wouldn't be as powerful as one of these, due to their less capable fabs, but something that's good enough to get the job done in spite of the embargoes.
Is that 150GB/s between elements that expect to run tightly coupled processes together? Maybe the bandwidth between chips is less important.
I mean, in a cluster you might have a bunch of nodes with 8x GPUs hanging off each, if this thing replaces a whole node rather than a single GPU, which I assume is the case, it is not really a useful comparison, right?
Ex Cerebras engineer. In my opinion, this is not going to be the case. The WSE-2 was a b** to program and debug. Their compilation strategy is a dead end, and they invest very little into developer ease. My two cents.
I would be more worried about the fact that next year every CPU is going to ship with some kind of AI accelerator already integrated to the die, which means the only competitive differentiation boils down to how much SRAM and memory bandwidth your AI accelerator is going to have. TOPS or FLOPS will become an irrelevant differentiator.
[+] [-] cs702|2 years ago|reply
Each parameter is a connection between artificial neurons. For example, inside an AI model, a linear layer that transforms an input vector with 1024 elements to an output vector with 2048 elements has 1024×2048 = ~2M parameters in a weight matrix. Each parameter specifies by how much each element in the input vector contributes to or subtracts from each element in the output vector. Each output vector element is a weighted sum (AKA a linear combination), of each input vector element.
A human brain has an estimated 100-500 trillion synapses connecting biological neurons. Each synapse is quite a complicated biological structure[a], but if we oversimplify things and assume that every synapse can be modeled as a single parameter in a weight matrix, then the largest AI models in use today have approximately 100T to 500T ÷ 0.5T = 200x to 1000x fewer connections between neurons that the human brain. If the company's claims prove true, this new chip will enable training of AI models that have only 4x to 20x fewer connections that the human brain.
We sure live in interesting times!
---
[a] https://en.wikipedia.org/wiki/Synapse
[+] [-] mlyle|2 years ago|reply
Which, it probably can't... but offsetting those simplifications and 4-20x difference is the massive difference in how quickly those synapses can be activated.
[+] [-] topspin|2 years ago|reply
So only 4-20 of these systems are necessary to match the human brain. No?
[+] [-] ipsum2|2 years ago|reply
...
It's meaningless to say something can train a model that has 24 trillion parameters without specifying the dataset size and time it takes to train.
[+] [-] brucethemoose2|2 years ago|reply
https://vimeo.com/853557623
https://web.archive.org/web/20230812020202/https://www.youtu...
(Vimeo/Archive because the original video was taken down from YouTube)
[+] [-] bsder|2 years ago|reply
[+] [-] bitwrangler|2 years ago|reply
200,000 electrical contacts
850,000 cores
and that's the "old" one. wow.
[+] [-] Rexxar|2 years ago|reply
[+] [-] dougmwne|2 years ago|reply
[+] [-] fxj|2 years ago|reply
https://www.cerebras.net/blog/whats-new-in-r0.6-of-the-cereb...
"CSL allows for compile time execution of code blocks that take compile-time constant objects as input, a powerful feature it inherits from Zig, on which CSL is based. CSL will be largely familiar to anyone who is comfortable with C/C++, but there are some new capabilities on top of the C-derived basics."
https://github.com/Cerebras/csl-examples
[+] [-] rbanffy|2 years ago|reply
And far fewer blinking lights.
[+] [-] GaryNumanVevo|2 years ago|reply
[+] [-] RetroTechie|2 years ago|reply
How many early supercomputers / workstations etc would that include? How much progress did humanity make using all those early machines (or any transistorized device!) combined?
[+] [-] ortusdux|2 years ago|reply
[+] [-] wincy|2 years ago|reply
[+] [-] yalok|2 years ago|reply
Quote:
Billion is a word for a large number, and it has two distinct definitions:
1,000,000,000, i.e. one thousand million, or 10^9 (ten to the ninth power), as defined on the short scale. This is now the most common sense of the word in all varieties of English; it has long been established in American English and has since become common in Britain and other English-speaking countries as well.
1,000,000,000,000, i.e. one million million, or 10^12 (ten to the twelfth power), as defined on the long scale. This number is the historical sense of the word and remains the established sense of the word in other European languages. Though displaced by the short scale definition relatively early in US English, it remained the most common sense of the word in Britain until the 1950s and still remains in occasional use there.
https://en.wikipedia.org/wiki/Billion
[+] [-] bee_rider|2 years ago|reply
[+] [-] imbusy111|2 years ago|reply
[+] [-] crotchfire|2 years ago|reply
I'm sure those TSVs connect to a huge array of switching power supplies, so the 24kW doesn't travel very far at such low voltages.
[+] [-] geph2021|2 years ago|reply
Imagine the heat sink on that thing. Would look like a cast-iron Dutch oven :)
[+] [-] asdfasdf1|2 years ago|reply
https://www.cerebras.net/product-chip/
[+] [-] Rexxar|2 years ago|reply
[+] [-] terafo|2 years ago|reply
[+] [-] londons_explore|2 years ago|reply
[+] [-] modeless|2 years ago|reply
In the days before LLMs 44 GB of SRAM sounded like a lot, but these days it's practically nothing. It's possible that novel architectures could be built for Cerebras that leverage the unique capabilities, but the inaccessibility of the hardware is a problem. So few people will ever get to play with one that it's unlikely new architectures will be developed for it.
[+] [-] txyx303|2 years ago|reply
The Nvidia bandwidth to compute ratio is more necessary because they are moving things around all the time. By keeping all the outputs on the wafer and only streaming the weights, you have a much more favorable requirement for BW to compute. And the number of layers becomes less impactful because they are storing transient outputs.
This is probably one of the primary reasons they didn't need to increase SRAM for WSE-3. WSE-2 was developed based on the old "fit the whole model on the chip" paradigm but models eclipsed 1TB so the new solution is more scalable.
[+] [-] brucethemoose2|2 years ago|reply
And keep in mind that these nodes are hilariously "fat" compared to a GPU node (or even an 8x GPU node), meaning less congestion and overhead from the topology.
[+] [-] imtringued|2 years ago|reply
[+] [-] TheDudeMan|2 years ago|reply
[+] [-] marmaduke|2 years ago|reply
[+] [-] tivert|2 years ago|reply
Maybe it wouldn't be as powerful as one of these, due to their less capable fabs, but something that's good enough to get the job done in spite of the embargoes.
[+] [-] eternauta3k|2 years ago|reply
[+] [-] asdfasdf1|2 years ago|reply
https://f.hubspotusercontent30.net/hubfs/8968533/Virtual%20B...
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] asdfasdf1|2 years ago|reply
- non-sparse fp16 in WSE-2 was 7.5 tflops (about 8 H100s, 10x worse performance per dollar)
Does anyone know the WSE-3 numbers? Datasheet seems lacking loads of details
Also, 2.5 million USD for 1 x WSE-3, why just 44GB tho???
[+] [-] xcv123|2 years ago|reply
You can order one with 1.2 Petabytes of external memory. Is that enough?
"External memory: 1.5TB, 12TB, or 1.2PB"
https://www.cerebras.net/press-release/cerebras-announces-th...
"214Pb/s Interconnect Bandwidth"
https://www.cerebras.net/product-system/
[+] [-] Tuna-Fish|2 years ago|reply
[+] [-] terafo|2 years ago|reply
[+] [-] bee_rider|2 years ago|reply
I mean, in a cluster you might have a bunch of nodes with 8x GPUs hanging off each, if this thing replaces a whole node rather than a single GPU, which I assume is the case, it is not really a useful comparison, right?
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] holoduke|2 years ago|reply
[+] [-] anon291|2 years ago|reply
[+] [-] incrudible|2 years ago|reply
I trust that gamers will outlast every hype, be it crypto or AI.
[+] [-] imtringued|2 years ago|reply
[+] [-] TradingPlaces|2 years ago|reply
[+] [-] api|2 years ago|reply
[+] [-] GaryNumanVevo|2 years ago|reply
[+] [-] beautifulfreak|2 years ago|reply