top | item 38735167

(no title)

joshhart | 2 years ago

If you are making many requests in batch this works ok because you can shuffle the next layer in while the current one is processing a set of matrix multiplies. This takes it from being a memory bound problem to a flops bound problem. This really only works if you care about throughput and not latency.

discuss

gpderetta|2 years ago

I understand that for each token mixtral will only need two (of eight) submodels. I wonder if there is temporal locality and an LRU caching schema could be used.

Kubuxu|2 years ago

It is two out of eight at each layer, with 32 layers independent of each other. There are no eight "sub-models".

However, this raises a question: could a slightly more complex router use output layer n-1 to choose experts for layer n+1 (vs n and n+1 today)? This way, there is more time to load the needed experts for the n+1 layer.