top | item 44484253

(no title)

panpog | 7 months ago

It seems plausible that this could be made efficiently doable byte-wise. For example, C3 xx could be made to uppercase to C4 xx. Unicode actually does structure its codespace to make certain properties easier to compute, but those properties are mostly related to legacy encodings, and things are designed with USC2 or UTF32 in mind, not UTF8.

It’s also not clear to me that the code point is a good abstraction in the design of UTF8. Usually, what you want is either the byte or the grapheme cluster.

discuss

order

karteum|7 months ago

> Usually, what you want is either the byte or the grapheme cluster.

Exactly ! That's what I understood after reading this great post https://tonsky.me/blog/unicode/

"Even in the widest encoding, UTF-32, [some grapheme] will still take three 4-byte units to encode. And it still needs to be treated as a single character. If the analogy helps, we can think of the Unicode itself (without any encodings) as being variable-length."

I tend to think it's the biggest design decision in Unicode (but maybe I just don't fully see the need and use-cases beyond emojis. Of course I read the section saying it's used in actual languages, but the few examples described could have been made with a dedicated 32 bits codepoint...)

panpog|7 months ago

Can you fit everything into 32 bits? I have no idea, but Hangul and indict scripts seem like they might have a combinatoric explosion of infrequently used characters.

duskwuff|7 months ago

Character case is a locale-dependent mess; trying to represent it in the values of code points (which need to be universal) is a terrible idea.

For example: in English, U+0049 and U+0069 ("I" and "i") are considered an uppercase/lowercase pair. In the Turkish locale, these are considered two separate characters with their own uppercase and lowercase versions: U+0049/U+0130 ("I" / "ı") and U+0131/U+0069 ("İ" / "i").

panpog|7 months ago

Of course you sometimes need tailoring to a particular language. On the other hand, I don't see how encoding untailered casing would make tailored casing harder.