(no title)
randomcatuser | 11 months ago
So if the model has 16 transformer layers to go through on a forward pass, and each layer, it gets to pick between 16 different choices, that's like 16^16 possible expert combinations!
randomcatuser | 11 months ago
So if the model has 16 transformer layers to go through on a forward pass, and each layer, it gets to pick between 16 different choices, that's like 16^16 possible expert combinations!
No comments yet.