top | item 42837045

(no title)

andrewgross | 1 year ago

Is there a concept of an expert that persists across layers? I thought each layer was essentially independent in terms of the "experts". I suppose you could look at what part of each layer was most likely to trigger together and segregate those by GPU though.

I could be very wrong on how experts work across layers though, I have only done a naive reading on it so far.

discuss

order

rahimnathwani|1 year ago

  I suppose you could look at what part of each layer was most likely to trigger together and segregate those by GPU though
Yes, I think that's what they describe in section 3.4 of the V3 paper. Section 2.1.2 talks about "token-to-expert affinity". I think there's a layer which calculates these affinities (between a token and an expert) and then sends the computation to the GPUs with the right experts.

This doesn't sound like it would work if you're running just one chat, as you need all the experts loaded at once if you want to avoid spending lots of time loading and unloading models. But at scale with batches of requests it should work. There's some discussion of this in 2.1.2 but it's beyond my current ability to comprehend!

andrewgross|1 year ago

Ahh got it, thanks for the pointer. I am surprised there is enough correlation there to allow an entire GPU to be specialized. I'll have to dig in to the paper again.