If you are making many requests in batch this works ok because you can shuffle the next layer in while the current one is processing a set of matrix multiplies. This takes it from being a memory bound problem to a flops bound problem. This really only works if you care about throughput and not latency.
gpderetta|2 years ago
Kubuxu|2 years ago
However, this raises a question: could a slightly more complex router use output layer n-1 to choose experts for layer n+1 (vs n and n+1 today)? This way, there is more time to load the needed experts for the n+1 layer.