top | item 42836422

(no title)

andrewgross | 1 year ago

> The beauty of the MOE model approach is that you can decompose the big model into a collection of smaller models that each know different, non-overlapping (at least fully) pieces of knowledge.

I was under the impression that this was not how MoE models work. They are not a collection of independent models, but instead a way of routing to a subset of active parameters at each layer. There is no "expert" that is loaded or unloaded per question. All of the weights are loaded in VRAM, its just a matter of which are actually loaded to the registers for calculation. As far as I could tell from the Deepseek v3/v2 papers, their MoE approach follows this instead of being an explicit collection of experts. If thats the case, theres no VRAM saving to be had using an MOE nor an ability to extract the weights of the expert to run locally (aside from distillation or similar).

If there is someone more versed on the construction of MoE architectures I would love some help understanding what I missed here.

discuss

order

Kubuxu|1 year ago

Not sure about DeepSeek R1, but you are right in regards to previous MoE architectures.

It doesn’t reduce memory usage, as each subsequent token might require different expert buy it reduces per token compute/bandwidth usage. If you place experts in different GPUs, and run batched inference you would see these benefits.

andrewgross|1 year ago

Is there a concept of an expert that persists across layers? I thought each layer was essentially independent in terms of the "experts". I suppose you could look at what part of each layer was most likely to trigger together and segregate those by GPU though.

I could be very wrong on how experts work across layers though, I have only done a naive reading on it so far.

rahimnathwani|1 year ago

  If you place experts in different GPUs
Right, this is described in the Deepseek V3 paper (section 3.4 on pages 18-20).