top | item 45843360

(no title)

zackangelo | 3 months ago

What 1T parameter base model have you seen from any of those labs?

discuss

order

riku_iki|3 months ago

its moe, each expert tower can be branched from some smaller model.

jychang|3 months ago

That's not how MoE works, you need to train the FFN directly or else the FFN gate would have no clue how to activate the expert.