top | item 44484568

(no title)

karteum | 7 months ago

> Usually, what you want is either the byte or the grapheme cluster.

Exactly ! That's what I understood after reading this great post https://tonsky.me/blog/unicode/

"Even in the widest encoding, UTF-32, [some grapheme] will still take three 4-byte units to encode. And it still needs to be treated as a single character. If the analogy helps, we can think of the Unicode itself (without any encodings) as being variable-length."

I tend to think it's the biggest design decision in Unicode (but maybe I just don't fully see the need and use-cases beyond emojis. Of course I read the section saying it's used in actual languages, but the few examples described could have been made with a dedicated 32 bits codepoint...)

discuss

panpog|7 months ago

Can you fit everything into 32 bits? I have no idea, but Hangul and indict scripts seem like they might have a combinatoric explosion of infrequently used characters.

eviks|7 months ago

But they don't have that explosion if you only encode the combinatoric primitives those characters are made of and then use composing rules?