top | item 27947544

(no title)

maxgraey | 4 years ago

So basically even in UTF8 you can create malformed string. For example ðŸŒ. It miss one byte and may cause to problems in some editors / text viewers which doesn't handle or pre-verify such cases . Valid UTF8 has a specific binary format. If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'. Similarly for three and four byte UTF8 characters it starts with '1110xxxx' and '11110xxx' followed by '10xxxxxx' one less times as there are bytes.

So it's not just UTF16 that has problems and can cause security problems. I just wanted to emphasize that

discuss

order

lokedhs|4 years ago

The point is that UTF-16 is the worst of both worlds. It's not ASCII compatible like UTF-8, but it still has the disadvantaged of being a variable length encoding.

Every problem that UTF-8, it shares with UTF-16. It also shares every problem with UTF-32.