(no title)
shawnz | 1 month ago
Regarding the second part: you'd effectively just be limiting yourself to single character tokens in that case which would drastically impact the LLM's output quality
shawnz | 1 month ago
Regarding the second part: you'd effectively just be limiting yourself to single character tokens in that case which would drastically impact the LLM's output quality
akoboldfrying|1 month ago
The second approach works with any subset of tokens that form a prefix code -- you effectively set the probability of all tokens outside this subset to zero (and rescale the remaining probabilities if necessary). In practice you would want to choose a large subset, which means you almost certainly want to avoid choosing any single-character tokens, since they can't coexist with tokens beginning with that character. (Choosing a largest-possible such subset sounds like an interesting subproblem to me.)
shawnz|1 month ago
Are you saying you'd intentionally make some output sequences impossible on the basis they're not likely enough to be worth violating the prefix code for? Surely there's enough common short words like "a", "the", etc that that would be impractical?
And even excluding the cases that are trivially impossible due to having short words as a prefix, surely even the longer words share prefixes commonly enough that you'd never get tokens longer than, say, two characters in the best case? Like, so many words start with "st" or "wh" or "re" or whatever, how could you possibly have a prefix code that captures all of them, or even the most common ones, without it being uselessly short?