top | item 46597604

(no title)

shawnz | 1 month ago

Regarding the first part: that's an interesting idea, although I worry it would bias the outputs in an unrealistic way. Then again, maybe it would only impact scenarios that would have otherwise been unparsable anyway?

Regarding the second part: you'd effectively just be limiting yourself to single character tokens in that case which would drastically impact the LLM's output quality

discuss

akoboldfrying|1 month ago

The first approach would only affect outputs that would have been otherwise unparseable.

The second approach works with any subset of tokens that form a prefix code -- you effectively set the probability of all tokens outside this subset to zero (and rescale the remaining probabilities if necessary). In practice you would want to choose a large subset, which means you almost certainly want to avoid choosing any single-character tokens, since they can't coexist with tokens beginning with that character. (Choosing a largest-possible such subset sounds like an interesting subproblem to me.)

shawnz|1 month ago

I don't think I see the vision here. If you want to maximize the number of tokens representable as a prefix code while still being able to output any sequence of characters, how could you possibly pick anything other than the one-character-long tokens?

Are you saying you'd intentionally make some output sequences impossible on the basis they're not likely enough to be worth violating the prefix code for? Surely there's enough common short words like "a", "the", etc that that would be impractical?

And even excluding the cases that are trivially impossible due to having short words as a prefix, surely even the longer words share prefixes commonly enough that you'd never get tokens longer than, say, two characters in the best case? Like, so many words start with "st" or "wh" or "re" or whatever, how could you possibly have a prefix code that captures all of them, or even the most common ones, without it being uselessly short?