top | item 35730689

(no title)

Silverback_VII | 2 years ago

And what about phonetics? Wouldn't it be easier for the system if it doesn't have to figure it out by itself?

discuss

order

PaulHoule|2 years ago

Exactly. But practically you have to trade one thing or another.

Before BPE I bailed on a project because the sponsor insisted on using word vectors and I thought "Look, the most important words in our documents will be out-of-dictionary and that's like playing chess down a queen, a rook and two pawns."

Once BPE and similar tokenizers came out now you could say that the model has a chance when it confronts out-of-dictionary situations which will always be important. This was critical to the success of transformers for text.

On the other hand there are many things wrong with tokenization for particular applications. If you want to handle Japanese text you'd think a word like 日本語 "Japanese" should be tokenized as a word or as 日本 + 語 ("japan" + "language")

A multilingual model however is very likely to tokenize those at the unicode character level so you don't even get 日 + 本 + 語 ("sun" + "origin" + "language") but might get underlying UTF-8 bytes like e6 + 97 + a5 + e6 + 9c + ac + e8 + aa + 9e which is just awful.

The trouble is an English language model doesn't want to waste a limited supply of tokens on other languages even though it should be able to handle a few foreign characters. A Japanese language model would clearly make different decisions, a model that supports a large number of languages is going to struggle to allocate tokens between them.

akiselev|2 years ago

Why is the supply of tokens limited? Are they currently represented as 16 bit unsigned ints (I hear vocab size of about 50k for GPT3)? If so, is there a performance penalty for going to u32 beyond the extra memory?

sp332|2 years ago

I would actually be less worried about a sequence of raw bytes than the tokens generated by BPE. If "Japan" is 01 and "language" is 02, then "Japanese" will probably be 03, which has no connection at all to 01 or 02. But raw, verbose encoding slows down convergence at the beginning. (Well, at least in English it does.)