uhhh i cast doubt on multi-language support as affecting latency. model size, maybe, but what is the mechanism for making latency worse? i think of model latency as O(log(model size))… but i am open to being wrong / that being a not-good mental model / educated guess.
kergonath|25 days ago
janalsncm|25 days ago
Bigger impact is before that you need to project the hidden state matrix to the vocab list. Something like 4096x250000. Bigger vocab=more FLOPs.
If you’re on a GPU things are parallelized so maybe it’s not quite linear if everything fits nicely. But on a cpu you’re going to struggle more.
This is why the juiciest target when shrinking models is the token embedding table. For example AlBERT factorized the whole embedding table to two low rank matrices.
ethmarks|25 days ago
make3|25 days ago