It’s not just a matter of the tokenization being the same, it’s whether the model can understand text that’s written with a very rarely seen encoding. Normally tokens represent entire words or portions of words, but in this case it’s not only broken into letters but into bytes, with two full tokens dedicated to every character. Text encoded this way is common (in flag emojis) but extremely lacking in diversity because it only encodes country codes. It’s unclear whether GPT-4 learned this ability by generalizing from country codes or through exposure to steganographic Unicode text on the web. Probably a combination of the two.
No comments yet.