top | item 35242184

(no title)

>it couldn't possibly understand how to spell "platoggle" if it's treating it just as a single, never-before-seen, opaque token

That's not how the tokenizer works. A novel word like "platoggle" is decomposed into three separate tokens, "pl", "at", and "oggle". You can see for yourself how prompts are tokenized: https://platform.openai.com/tokenizer

discuss

hn_throwaway_99|2 years ago

Ahh, thank you very much, definitely was missing that piece!

kgc|2 years ago

Why don’t they also have single letters as tokens?

potatoinexhaust|2 years ago

They do, e.g. "gvqbkpwz" is tokenised into individual characters. Actually it was a bit tricky to construct that, since I needed to find letter combinations that are very low probability in tokeniser's training text (e.g. "gv").

So notice it doesn't contain any vowels, since almost all consonant-vowel pairs are sufficiently frequent in the training text as to be tokenised at least as a pair. E.g. "guq" is tokenised as "gu" + "q", since "gu" is common enough.

(Compare "gun" which is just tokenised as a single token "gun", as it's common enough in the training set as a word on its own, so it doesn't need to tokenise it as "gu"+"n".)

The only exceptions I found with consonant-vowel pairs being tokenised as pairs were ones like "qe", tokenised as "q" + "e". Or "qo" as "q"+"o". Which I guess makes sense, given these will be low-frequency pairings in the training text -- compare "qu" just tokenised as "qu".

(Though I didn't test all consonant-vowel pairs, so there may be more).

ben_w|2 years ago

My wild guess is that if it could get things done by tokenising like that all the time, they wouldn't need to also have word-like tokens.

If that is a inference time performance or training time performance or a model size issue or just total nonsense, I wouldn't know.