top | item 43596612

(no title)

InvOfSmallC | 11 months ago

For a super ignorant person:

Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each

Those experts are LLM trained on specific tasks or what?

discuss

order

vessenes|11 months ago

This was an idea that sounded somewhat silly until it was shown it worked. The idea is that you encourage through training a bunch of “experts” to diversify and “get good” at different things. These experts are say 1/10 to 1/100 of your model size if it were a dense model. So you pack them all up into one model, and you add a layer or a few layers that have the job of picking which small expert model is best for your given token input, route it to that small expert, and voila — you’ve turned a full run through the dense parameters into a quick run through a router and then a 1/10 as long run through a little model. How do you get a “picker” that’s good? Well, it’s differentiable, and all we have in ML is a hammer — so, just do gradient descent on the decider while training the experts!

This generally works well, although there are lots and lots of caveats. But it is (mostly) a free lunch, or at least a discounted lunch. I haven’t seen a ton of analysis on what different experts end up doing, but I believe it’s widely agreed that they tend to specialize. Those specializations (especially if you have a small number of experts) may be pretty esoteric / dense in their own right.

Anthropic’s interpretability team would be the ones to give a really high quality look, but I don’t think any of Anthropic’s current models are MoE.

Anecdotally, I feel MoE models sometimes exhibit slightly less “deep” thinking, but I might just be biased towards more weights. And they are undeniably faster and better per second of clock time, GPU time, memory or bandwidth usage — on all of these - than dense models with similar training regimes.

zamadatix|11 months ago

The only thing about this which may be unintuitive from the name is an "Expert" is not something like a sub-llm that's good at math and gets called when you ask a math question. Models like this have layers of networks they run tokens through and each layer is composed of 256 sub-networks, any of which can be selected (or multiple selected and merged in some way) for each layer independently.

So the net result is the same: sets of parameters in the model are specialized and selected for certain inputs. It's just a done a bit deeper in the model than one may assume.

philsnow|11 months ago

The idea has also been around for at least 15 years; "ensemble learning" was a topic in my "Data Mining" textbook from around then.

Meta calls these individually smaller/weaker models "experts" but I've also heard them referred to as "bozos", because each is not particularly good at anything and it's only together that they are useful. Also bozos has better alliteration with boosting and bagging, two terms that are commonly used in ensemble learning.

Buttons840|11 months ago

If I have 5000 documents about A, and 5000 documents about B, do we know whether it's better to train one large model on all 10,000 documents, or to train 2 different specialist models and then combine them as you describe?

MoonGhost|11 months ago

> Anecdotally, I feel MoE models sometimes exhibit slightly less “deep” thinking

Makes sense to compare apples with apples. Same compute amount, right? Or you are giving less time to MoE model and then feel like it underperforms. Shouldn't be surprising...

> These experts are say 1/10 to 1/100 of your model size if it were a dense model

Just to be correct, each layer (attention + fully connected) has it's own router and experts. There are usually 30++ layers. It can't be 1/10 per expert as there are literally hundreds of them.

tomjen3|11 months ago

Cool. Those that mean I could just run the query through the router and then load only the required expert? That is could I feasibly run this on my Macbook?

faraaz98|11 months ago

I've been calling for this approach for a while. It's kinda similar to how the human brain has areas that are good at specific tasks

randomcatuser|11 months ago

yes, and it's on a per-layer basis, I think!

So if the model has 16 transformer layers to go through on a forward pass, and each layer, it gets to pick between 16 different choices, that's like 16^16 possible expert combinations!

mrbonner|11 months ago

So this is kind of an ensemble sort of thing in ML like random forest and GBT?

chaorace|11 months ago

The "Experts" in MoE is less like a panel of doctors and more like having different brain regions with interlinked yet specialized functions.

The models get trained largely the same way as non-MoE models, except with specific parts of the model silo'd apart past a certain layer. The shared part of the model, prior to the splitting, is the "router". The router learns how to route as an AI would, so it's basically a black-box in terms of whatever internal structure emerges from this.

pornel|11 months ago

No, it's more like sharding of parameters. There's no understandable distinction between the experts.

vintermann|11 months ago

I understand they're only optimizing for load distribution, but have people been trying to disentangle what the the various experts learn?

brycethornton|11 months ago

I believe Mixture-of-Experts is a way for a neural network to group certain knowledge into smaller subsets. AFAIK there isn't a specific grouping goal, the network just figures out what goes where on it's own and then when an inference request is made it determines what "expert" would have that knowledge and routes it there. This makes the inference process much more efficient.