Limited memory, poor performance, etc. If you flip these performance improvement announcements they become damning; e.g. a 6x improvement means that they were previously running at less than 1/6th of optimal performance.
Not that I know anything about these things, but FTA:
"..To briefly recap: the Cerebras Wafer Scale Cluster extends the 40GB of on-chip memory of each CS-2 with over 12 Terabytes of external memory. Unlike GPUs which only have 80GB of memory for all weights and activations, we store weights in external memory and activations in on-chip memory. The CS-2 works on one layer of the model at a time and weights are streamed in as needed, hence the name – Weight Streaming. This model provides us with over 100x larger aggregate memory capacity than GPUs, allowing us to natively support trillion parameter models without resorting to complex model partitioning schemes such as tensor and pipeline parallelism.
wmf|2 years ago
sillywalk|2 years ago
"..To briefly recap: the Cerebras Wafer Scale Cluster extends the 40GB of on-chip memory of each CS-2 with over 12 Terabytes of external memory. Unlike GPUs which only have 80GB of memory for all weights and activations, we store weights in external memory and activations in on-chip memory. The CS-2 works on one layer of the model at a time and weights are streamed in as needed, hence the name – Weight Streaming. This model provides us with over 100x larger aggregate memory capacity than GPUs, allowing us to natively support trillion parameter models without resorting to complex model partitioning schemes such as tensor and pipeline parallelism.