Exactly. But practically you have to trade one thing or another.
Before BPE I bailed on a project because the sponsor insisted on using word vectors and I thought "Look, the most important words in our documents will be out-of-dictionary and that's like playing chess down a queen, a rook and two pawns."
Once BPE and similar tokenizers came out now you could say that the model has a chance when it confronts out-of-dictionary situations which will always be important. This was critical to the success of transformers for text.
On the other hand there are many things wrong with tokenization for particular applications. If you want to handle Japanese text you'd think a word like 日本語 "Japanese" should be tokenized as a word or as 日本 + 語 ("japan" + "language")
A multilingual model however is very likely to tokenize those at the unicode character level so you don't even get 日 + 本 + 語 ("sun" + "origin" + "language") but might get underlying UTF-8 bytes like e6 + 97 + a5 + e6 + 9c + ac + e8 + aa + 9e which is just awful.
The trouble is an English language model doesn't want to waste a limited supply of tokens on other languages even though it should be able to handle a few foreign characters. A Japanese language model would clearly make different decisions, a model that supports a large number of languages is going to struggle to allocate tokens between them.
Why is the supply of tokens limited? Are they currently represented as 16 bit unsigned ints (I hear vocab size of about 50k for GPT3)? If so, is there a performance penalty for going to u32 beyond the extra memory?
I would actually be less worried about a sequence of raw bytes than the tokens generated by BPE. If "Japan" is 01 and "language" is 02, then "Japanese" will probably be 03, which has no connection at all to 01 or 02. But raw, verbose encoding slows down convergence at the beginning. (Well, at least in English it does.)
PaulHoule|2 years ago
Before BPE I bailed on a project because the sponsor insisted on using word vectors and I thought "Look, the most important words in our documents will be out-of-dictionary and that's like playing chess down a queen, a rook and two pawns."
Once BPE and similar tokenizers came out now you could say that the model has a chance when it confronts out-of-dictionary situations which will always be important. This was critical to the success of transformers for text.
On the other hand there are many things wrong with tokenization for particular applications. If you want to handle Japanese text you'd think a word like 日本語 "Japanese" should be tokenized as a word or as 日本 + 語 ("japan" + "language")
A multilingual model however is very likely to tokenize those at the unicode character level so you don't even get 日 + 本 + 語 ("sun" + "origin" + "language") but might get underlying UTF-8 bytes like e6 + 97 + a5 + e6 + 9c + ac + e8 + aa + 9e which is just awful.
The trouble is an English language model doesn't want to waste a limited supply of tokens on other languages even though it should be able to handle a few foreign characters. A Japanese language model would clearly make different decisions, a model that supports a large number of languages is going to struggle to allocate tokens between them.
akiselev|2 years ago
sp332|2 years ago