top | item 42868156

(no title)

guntars | 1 year ago

Since it's a MoE model with 37B active params, I imagined you don't even need all of that ram to keep the whole model in memory, just the active bits.

discuss

rahimnathwani|1 year ago

The active bits may change with each token. You need the whole model in memory, even though, for any single token, only a subset of that memory will have been used in computation. The memory efficiency comes when you have multiple sessions in parallel.