(no title)
pimanrules | 2 years ago
That's not how the tokenizer works. A novel word like "platoggle" is decomposed into three separate tokens, "pl", "at", and "oggle". You can see for yourself how prompts are tokenized: https://platform.openai.com/tokenizer
hn_throwaway_99|2 years ago
kgc|2 years ago
potatoinexhaust|2 years ago
So notice it doesn't contain any vowels, since almost all consonant-vowel pairs are sufficiently frequent in the training text as to be tokenised at least as a pair. E.g. "guq" is tokenised as "gu" + "q", since "gu" is common enough.
(Compare "gun" which is just tokenised as a single token "gun", as it's common enough in the training set as a word on its own, so it doesn't need to tokenise it as "gu"+"n".)
The only exceptions I found with consonant-vowel pairs being tokenised as pairs were ones like "qe", tokenised as "q" + "e". Or "qo" as "q"+"o". Which I guess makes sense, given these will be low-frequency pairings in the training text -- compare "qu" just tokenised as "qu".
(Though I didn't test all consonant-vowel pairs, so there may be more).
ben_w|2 years ago
If that is a inference time performance or training time performance or a model size issue or just total nonsense, I wouldn't know.