top | item 39005208

(no title)

goodside | 2 years ago

I can’t imagine it was intentionally added as a feature. It doesn’t work in GPT-3.5 — it seems GPT-4 is unexpectedly smart enough to parse the invisible portion (and confuse it for user instruction) whereas in any other context it’s just steganography that would need to be decoded explicitly.

discuss

dietr1ch|2 years ago

I'd guess that the tokenizer is just different and handles this in a "better" way.

goodside|2 years ago

No, in both tokenizers Unicode tag-block code points like these are converted into bytes (two tokens per character), which is a fallback for code points uncommon enough to not warrant a dedicated token.

rahimnathwani|2 years ago

How and why would the tokenizer learn that particular unicode tag was equivalent to a particular letter? I can't imagine there's a lot of text on the internet encoded in this way.

kevingadd|2 years ago

maybe it saw them used in their intended way (for flags, etc) and was able to make the association between the flags and their country codes, and then that led to it being able to interpret them as individual letters?

could also be from having been trained on unicode character tables, which contain english descriptions of each code point