(I’m the person interviewed in the article.) The trick is Unicode code points are only assigned individual tokens if they’re nontrivially used outside of some other already tokenized sequence, and Unicode tag block code points are only ever used in flag emojis. Unused or rarely used Unicode code points are given a fallback encoding that just encodes the numerical code point value in two special tokens. Because the Unicode tag block is by design the first 128 chars in ASCII repeated, the second token of the tokenized output directly corresponds to the ASCII value of the character.
xg15|1 year ago
goodside|1 year ago