top | item 42846786

(no title)

raphaelj | 1 year ago

Do we know which changes made DeepSeek V3 so much faster and better at training than other models? DeepSeek R1's performances seem to be highly related to V3 being a very good model to start with.

I went through the paper and I understood they made these improvements compared to "regular" MoE models:

1. Latent Multi-head Attention. If I understand correctly, they were able to do some caching on the attention computation. This one is still a little bit confusing to me;

2. New MoE architecture with one shared expert and a large number of small routed experts (256 total, but 8 in use in the same token inference). This was already used in DeepSeek v2;

3. Better load balancing of the training of experts. During training, they add bias or "bonus" value to experts that are less used, to make them more likely to be selected in the future training steps;

4. They added a few smaller transformer layers to predict not only the first next token, but a few additional tokens. Their training error/loss function then uses all these predicted tokens as input, not only the first one. This is supposed to improve the transformer capabilities in predicting sequences of tokens;

5. They are using FP8 instead of FP16 when it does not impact accuracy.

It's not clear to me which changes are the most important, but my guess would be that 4) is a critical improvement.

1), 2), 3) and 5) could explain why their model trains faster by some small factor (+/- 2x), but neither the 10x advertised boost nor how is performs greatly better than models with way more activated parameters (e.g. llama 3).

discuss

whoisburbansky|1 year ago

The key idea of Latent MHA is that "regular" multi-headed attention needs you to keep a bunch of giant key-value (KV) matrices around in memory to do inference. The "Latent" part just means that DeepSeek takes the `n` KV matrices in a given n-headed attention block and replaces them with a lower-rank approximation (think of this as compressing the matrices), so that they take up less VRAM in a GPU at the cost of a little extra compute and a little lost accuracy. So not caching, strictly speaking, but weight compression to trade compute off for better memory usage, which is good because the KV matrices are one of the more expensive part of this transformer architecture. MoE addresses the other expensive part (the fully-connected layers) by making it so only a subset of the fully-connected layers are active at any given forward pass.

whoisburbansky|1 year ago

https://planetbanatt.net/articles/mla.html this is a great overview of how MLA works.

alecco|1 year ago

They also did bandwidth scaling to handle work around the nerfed H800 interconnects.

> efficient cross-node all-to-all communication kernels to fully utilize IB and NVLink bandwidths

> The key idea of DualPipe is to overlap the computation and communication within a pair of individual forward and backward chunks. To be specific, we divide each chunk into four components: attention, all-to-all dispatch, MLP, and all-to-all combine. Specially, for a backward chunk, both attention and MLP are further split into two parts, backward for input and backward for weights, like in ZeroBubble (Qi et al., 2023b). In addition, we have a PP communication component.

(I know some of those words)

https://arxiv.org/html/2412.19437v1

sinuhe69|1 year ago

I think the fact that they used synthetic/distilled high-quality data from GPT4-o output to train in the style of Phi models are of significance as well.