(no title)
raphaelj | 1 year ago
I went through the paper and I understood they made these improvements compared to "regular" MoE models:
1. Latent Multi-head Attention. If I understand correctly, they were able to do some caching on the attention computation. This one is still a little bit confusing to me;
2. new MoE architecture with one shared expert and a large number of small routed experts (256 total, but 8 in use in the same token inference). This was already used in DeepSeek v2;
3. Better load balancing of the training of experts. During training, they add bias or "bonus" value to experts that are less used to make them more likely to be selected in the future training steps;
4. They added a few smaller transformer layers to predict not only the first next token, but a few additional tokens. Their training error/loss function then uses all these predicted tokens, not only the first one. This is supposed to improve the transformer capabilities of predicting sequences of tokens. Note that they don't use this for inference, except for some latency optimisation by doing speculative execution on the 2nd token.
5. They are using FP8 instead of FP16 when it does not impact accuracy.
My guess would be that 4) is the most impactful improvement. 1), 2), 3) and 5) could explain why their model train faster, but not how is performs greatly better than models with way more activated parameters (e.g. llama 3).
No comments yet.