top | item 41928909

(no title)

quirkot | 1 year ago

Is this true?

>> Do not panic! A lot of the large LLM vocabularies are pretty huge (30k-300k tokens large)

Seems small by an order of magnitude (at least). English alone is 1+ millions words

discuss

order

macleginn|1 year ago

Most of these 1+ million words are almost never used, so 200k is plenty for English. Optimistically, we hope that rarer words would be longer and to some degree compositional (optim-ism, optim-istic, etc.), but unfortunately this is not what tokenisers arrive at (and you are more likely to get "opt-i-mis-m" or something like that). People have tried to optimise tokenisation and the main part of LLM training jointly, which leads to more sensible results, but this is unworkable for larger models, so we are stuck with inflated basic vocabularies.

It is also probably possible now to go even for larger vocabularies, in the 1-2 million range (by factorising the embedding matrix, for example), but this does not lead to noticeable improvements in performance, AFAIK.

Der_Einzige|1 year ago

Performance would be massively improved on constrained text tasks. That alone makes it worth it to expand the vocabulary size.

mmoskal|1 year ago

Tokens are often sub-word, all the way down to bytes (which are implicitly understood as UTF8 but models will sometimes generate invalid UTF8...).

spott|1 year ago

BPE is complete. Every valid Unicode string can be encoded with any BPE tokenizer.

BPE basically starts with a token for every valid value for a Unicode byte and then creates new tokens by looking at common pairs of bytes (‘t’ followed by ‘h’ becomes a new token ’th’)