top | item 45843360 (no title) zackangelo | 3 months ago What 1T parameter base model have you seen from any of those labs? discuss order hn newest riku_iki|3 months ago its moe, each expert tower can be branched from some smaller model. jychang|3 months ago That's not how MoE works, you need to train the FFN directly or else the FFN gate would have no clue how to activate the expert.
riku_iki|3 months ago its moe, each expert tower can be branched from some smaller model. jychang|3 months ago That's not how MoE works, you need to train the FFN directly or else the FFN gate would have no clue how to activate the expert.
jychang|3 months ago That's not how MoE works, you need to train the FFN directly or else the FFN gate would have no clue how to activate the expert.
riku_iki|3 months ago
jychang|3 months ago