(no title)
panpog | 7 months ago
It’s also not clear to me that the code point is a good abstraction in the design of UTF8. Usually, what you want is either the byte or the grapheme cluster.
panpog | 7 months ago
It’s also not clear to me that the code point is a good abstraction in the design of UTF8. Usually, what you want is either the byte or the grapheme cluster.
karteum|7 months ago
Exactly ! That's what I understood after reading this great post https://tonsky.me/blog/unicode/
"Even in the widest encoding, UTF-32, [some grapheme] will still take three 4-byte units to encode. And it still needs to be treated as a single character. If the analogy helps, we can think of the Unicode itself (without any encodings) as being variable-length."
I tend to think it's the biggest design decision in Unicode (but maybe I just don't fully see the need and use-cases beyond emojis. Of course I read the section saying it's used in actual languages, but the few examples described could have been made with a dedicated 32 bits codepoint...)
panpog|7 months ago
duskwuff|7 months ago
For example: in English, U+0049 and U+0069 ("I" and "i") are considered an uppercase/lowercase pair. In the Turkish locale, these are considered two separate characters with their own uppercase and lowercase versions: U+0049/U+0130 ("I" / "ı") and U+0131/U+0069 ("İ" / "i").
panpog|7 months ago