top | item 33818916

(no title)

zasdffaa | 3 years ago

Do you (or anyone) have some idea why anyone could possibly have thought 16 bits would be enough? Many decisions are bad in hindsight but surely no hindsight was needed for that.

discuss

mananaysiempre|3 years ago

Nope. But on reflection, I can’t really tell if it was really that dumb of an idea.

If you look at the text following the “Yes” quote, you’ll find that “all characters” is carefully defined to mean ”all characters in current use from commercially non-negligible scripts”. Compared to the current definition of “all characters we have reasonable evidence have ever been used for natural-language interchange”, it doesn’t sound as noble, but would also exclude a number of large-repertoire sets (Tangut and pre-modern Han ideograms, Yi syllables, hieroglyphs, cuneiform). Remove the requirement for 1:1 code point mapping with legacy sets, and you could conceivably throw out precomposed Hangul as well. (Precomposed European scripts too, if you want, but that wouldn’t net you eleven thousand codepoints.)

At that point the question seems to come down to Han characters: the union of all government-mandated education standards (unified) would come down well below ten thousand characters, but how well does that number correspond to the number of characters people actually need? One potential source of death is uncommon characters people really, really want (proper names), but overall, I don’t know, you’d probably need a CJKV expert to tell. To me, neither answer seems completely implausible.

On the other hand, it’s also unclear that a constant-width encoding would really be all that valuable. Most of the time, you are either traversing all code points in sequence or working with larger units such as combining-character sequences or graphemes, so aside from buffer truncation issues constant width does not really help all that much. But that’s an observation that took more than a decade of Unicode implementations to crystallize.

It is certainly annoying how large and sparse the lookup tables needed to implement a current version of Unicode are—enough that you need three levels in your radix tree and not two—but if you aren’t doing locales it’s still a question of at most several dozens of kilobytes, not really a deal breaker these days. Perhaps that’s not too much of a cost for not marginalizing users of obscure languages and keeping digitized historical text representable in the common format.