(no title)
johndough | 15 days ago
If a layer completely fits in SRAM (as is probably the case for Cerebras), you only have to communicate the hidden states between chips for each token. The hidden states are very small (7168 floats for DeepSeek-V3.2 https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/c... ), which won't be a bottleneck.
Things get more complicated if a layer does not fit in SRAM, but it still works out fine in the end.
No comments yet.