(no title)
andrewgross | 1 year ago
I was under the impression that this was not how MoE models work. They are not a collection of independent models, but instead a way of routing to a subset of active parameters at each layer. There is no "expert" that is loaded or unloaded per question. All of the weights are loaded in VRAM, its just a matter of which are actually loaded to the registers for calculation. As far as I could tell from the Deepseek v3/v2 papers, their MoE approach follows this instead of being an explicit collection of experts. If thats the case, theres no VRAM saving to be had using an MOE nor an ability to extract the weights of the expert to run locally (aside from distillation or similar).
If there is someone more versed on the construction of MoE architectures I would love some help understanding what I missed here.
Kubuxu|1 year ago
It doesn’t reduce memory usage, as each subsequent token might require different expert buy it reduces per token compute/bandwidth usage. If you place experts in different GPUs, and run batched inference you would see these benefits.
andrewgross|1 year ago
I could be very wrong on how experts work across layers though, I have only done a naive reading on it so far.
rahimnathwani|1 year ago