top | item 42016981

(no title)

rmrfchik | 1 year ago

It's not UTF-8 characters but Unicode.

discuss

If you look at the list, it’s primarily (but not completely) about oddities in their UTF-8 encoding. Most of them appear to be on the boundary of adding additional bytes when the case is changed. That’s not really Unicode’s concern.

There are also some that appear to change from single characters to grapheme clusters, which would be a Unicode quirk.

Rendello|1 year ago

In another comment I said that a more accurate title would have been "Unicode codepoints that expand or contract when case is changed in UTF-8", which I think covers it well.

Aardwolf|1 year ago

The byte-changes listed are for the UTF-8 encoding though, so it's about UTF-8 in that sense

Retr0id|1 year ago

It's both.

zahlman|1 year ago

UTF-8 is simply an encoding; "UTF-8 characters" is just not correct use of language. Just like, say, "binary number"; a number has the same value regardless of the base you use to write it, and the base is a scheme for representing it, not a system for defining what a number is. This is a common imprecision in language which I have seen cause serious difficulties in learning concepts properly.