Ahh got it, thanks for the pointer. I am surprised there is enough correlation there to allow an entire GPU to be specialized. I'll have to dig in to the paper again.
It does. They have 256 experts per MLP layer, and some shared ones. The minimal deployment for decoding (aka. token generation) they recommend is 320 GPUs (H800). It is all in the DeepSeek v3 paper that everyone should read rather than speculating.
Got it. I’ll review the paper again for that portion. However, it still sounds like the end result is not VRAM savings but efficiently and speed improvements.
I don't think entire GPU is specialised nor a singular token will use the same expert. I think about it as a gather-scatter operation at each layer.
Let's say you have an inference batch of 128 chats, at layer `i` you take the hidden states, compute their routing, scatter them along with the KV for those layers among GPUs (each one handling different experts), the attention and FF happens on these GPUs (as model params are there) and they get gathered again.
You might be able to avoid the gather by performing the routing on each of the GPUs, but I'm generally guessing here.
liuliu|1 year ago
andrewgross|1 year ago
Kubuxu|1 year ago
Let's say you have an inference batch of 128 chats, at layer `i` you take the hidden states, compute their routing, scatter them along with the KV for those layers among GPUs (each one handling different experts), the attention and FF happens on these GPUs (as model params are there) and they get gathered again.
You might be able to avoid the gather by performing the routing on each of the GPUs, but I'm generally guessing here.