top | item 41850010

(no title)

goodside | 1 year ago

(I’m the person interviewed in the article.) The trick is Unicode code points are only assigned individual tokens if they’re nontrivially used outside of some other already tokenized sequence, and Unicode tag block code points are only ever used in flag emojis. Unused or rarely used Unicode code points are given a fallback encoding that just encodes the numerical code point value in two special tokens. Because the Unicode tag block is by design the first 128 chars in ASCII repeated, the second token of the tokenized output directly corresponds to the ASCII value of the character.

discuss

order

xg15|1 year ago

Ah, so the model "sees" the tags as literal ASCII characters interspersed with special tokens? That would make more sense.

goodside|1 year ago

More or less; they’re not literally the same tokens as “a”, “b”, “c” but I’d speculate the mapping is learned from some other examples of ASCII (or just Roman letters) being repeated in other obscure parts of Unicode — Gothic glyphs, bubble letters, etc. Once the model has seen enough ASCII represented as Unicode code points whose tokenizations alternate between meaningless and meaningful (e.g. “~l~i~k~e~ ~t~h~i~s”) it learns how to read it regardless of what the ”~” is.