(no title)
InvOfSmallC | 11 months ago
Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each
Those experts are LLM trained on specific tasks or what?
InvOfSmallC | 11 months ago
Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each
Those experts are LLM trained on specific tasks or what?
vessenes|11 months ago
This generally works well, although there are lots and lots of caveats. But it is (mostly) a free lunch, or at least a discounted lunch. I haven’t seen a ton of analysis on what different experts end up doing, but I believe it’s widely agreed that they tend to specialize. Those specializations (especially if you have a small number of experts) may be pretty esoteric / dense in their own right.
Anthropic’s interpretability team would be the ones to give a really high quality look, but I don’t think any of Anthropic’s current models are MoE.
Anecdotally, I feel MoE models sometimes exhibit slightly less “deep” thinking, but I might just be biased towards more weights. And they are undeniably faster and better per second of clock time, GPU time, memory or bandwidth usage — on all of these - than dense models with similar training regimes.
zamadatix|11 months ago
So the net result is the same: sets of parameters in the model are specialized and selected for certain inputs. It's just a done a bit deeper in the model than one may assume.
philsnow|11 months ago
Meta calls these individually smaller/weaker models "experts" but I've also heard them referred to as "bozos", because each is not particularly good at anything and it's only together that they are useful. Also bozos has better alliteration with boosting and bagging, two terms that are commonly used in ensemble learning.
Buttons840|11 months ago
MoonGhost|11 months ago
Makes sense to compare apples with apples. Same compute amount, right? Or you are giving less time to MoE model and then feel like it underperforms. Shouldn't be surprising...
> These experts are say 1/10 to 1/100 of your model size if it were a dense model
Just to be correct, each layer (attention + fully connected) has it's own router and experts. There are usually 30++ layers. It can't be 1/10 per expert as there are literally hundreds of them.
tomjen3|11 months ago
faraaz98|11 months ago
randomcatuser|11 months ago
So if the model has 16 transformer layers to go through on a forward pass, and each layer, it gets to pick between 16 different choices, that's like 16^16 possible expert combinations!
mrbonner|11 months ago
chaorace|11 months ago
The models get trained largely the same way as non-MoE models, except with specific parts of the model silo'd apart past a certain layer. The shared part of the model, prior to the splitting, is the "router". The router learns how to route as an AI would, so it's basically a black-box in terms of whatever internal structure emerges from this.
pornel|11 months ago
vintermann|11 months ago
brycethornton|11 months ago
lern_too_spel|11 months ago