top | item 8511403

I � Unicode [pdf]

98 points| beefburger | 11 years ago |seriot.ch | reply

43 comments

order
[+] jrochkind1|11 years ago|reply
Unicode is just about the most technically successful standard I have ever seen, it's pretty amazing.

The weird and complicated parts are all a result of the weirdness and complexity of the domain -- the universe of human written language. All the solutions are amazingly elegant for the domain they are in -- including solutions to legacy backwards compatibility where possible, which have made unicode as successful at catching on as it has been. The decisions on compromises between practial legacy compatibility and pure elegance were _just right_.

The only major mis-step was the "UCS-2" mistake, before they realized more bytes really were going to be needed, sadly now stuck in Java and making proper unicode support in Java way harder than it should be.

But in general, if only all our standards dealing with very complex problems could be as elegantly designed and executed as unicode.

[+] thristian|11 years ago|reply
Something that often gets missed out of the Unicode story was that originally there were two groups. The first was the Unicode consortium, who wanted to combine all the world's existing character encodings, and had picked 16-bit units as a comfortable representation, which would have been more than enough for their stated goal.

When Unicode 1.0 came out, there were a bunch of people forming the ISO 10646 committee to produce a character encoding that would cover every human-written character ever, even the ones that weren't already part of an existing encoding, but 16 bits would definitely not be sufficient for that. On the other hand, creating two entirely separate standards wouldn't be a great idea either, so they joined forces and created Unicode 2.0 with astral planes and surrogate pairs and all that expansion business.

The point is, we shouldn't blame the Unicode consortium for short-sightedness, we should blame scope-creep.

[+] masklinn|11 years ago|reply
> sadly now stuck in Java

If only it were only java…

It's also encoded in the JS spec, and throughout the Windows ecosystem… and anything spawned from the latter which is why bloody UEFI uses UCS-2.

[+] guard-of-terra|11 years ago|reply
It's a shame they use ISO-8859-5 as an example because it was never used by anyone in practice. It's a stillborn standard.

First, we had IBM-866 and КОИ-8 aka KOI8-R, then painfully switched to WINDOWS-1251, and then Unicode. ISO-8859-5 was never adopted by anyone.

[+] TheLoneWolfling|11 years ago|reply
Here's my thoughts on unicode:

Options:

1) Use UTF-32 everywhere. When space is an issue, just compress it - especially on disk. If you need random access to a string, use a seekable compression algorithm on it on-the-fly. Alternatively, use a compression algorithm with checkpoints and maintain a sorted list of where checkpoints start and how far along in the associated decompressed text you are. (Effectively rolling your own.) Note that this method doesn't work well with writes.

2) Use an interesting variant of a rope. Use a rope, but a) keep track of "logical characters" instead of code points - what unicode calls graphimes, IIRC, and b) have each node have an encoding - and restrict that all characters within a specific node have the same width. This allows for pretty much everything being sublinear. If you allow a bit in a node for "special" nodes (i.e. reversed, lazy-loaded, slice of another node, that sort of thing), reversing, among other things, is actually truly O(1). Bunches of optimizations here - you want to fall back to a "node" that's a flat array for small strings, you want to potentially use overlong encodings internally where appropriate (i.e. if you have 1 1-byte character in a bunch of 2-byte characters, that sort of thing), you want to have some encodings that aren't fixed-width (for things like reading a bunch of bytes from a file), you want to have an encoding that's "unknown" / binary data.

Thoughts:

1) Why on earth does any higher-level language still use byte or codepoint counts for length? And why don't lower-level languages at least have a way to count / index by graphimes?

2) I do not like UTF-8 / 16. It's effectively bad huffman encoding. It's an attempt to save space, but it doesn't even do that well. About the only advantage of UTF-8 is that ASCII maps to it reasonably well. And it has a bunch of disadvantages, chief among them being that if you write a single miltibyte character, you potentially have to rewrite the entire string.

[+] masklinn|11 years ago|reply
> Use UTF-32 everywhere. When space is an issue, just compress it - especially on disk.

> I do not like UTF-8 / 16. It's effectively bad huffman encoding. It's an attempt to save space, but it doesn't even do that well.

UTF-8 + gzip is 32% smaller than UTF-32 + gzip using the HN frontpage as corpus. Even using xz, it's a 13% gain.

> About the only advantage of UTF-8 is that ASCII maps to it reasonably well.

That's a pretty huge advantage, and a big reason why UTF-8 is actually popular. An other one is UTF-8 being byte-based, it does not care for byte order. UTF-32 is split between BE and LE, and requires either out-of-band byte-order communication or a BOM.

> Why on earth does any higher-level language still use byte or codepoint counts for length?

Because it's easy, and generally O(1) in these languages. Can also be useful to know how much space it'll take when stored, which really is the only useful use for a string length.

> And why don't lower-level languages at least have a way to count / index by graphimes?

Counting graphemes is no more useful than counting bytes or codepoints. You could provide a grapheme cluster count, but:

1. that's O(n) period

2. it serves very little purpose since clusters don't have a fixed width, not even with a fixed-width font

3. clusters can be locale-dependent ("tailored" clusters) although the default set is locale-independent. Now you need to ponder whether you include tailored clusters, don't include them, or optionally include them

4. clusters and glyphs are independent, "ch" is a grapheme cluster in Slovak but two glyphs on-screen, whereas an "fi" ligature is a single glyph but two clusters

[+] jrochkind1|11 years ago|reply
> About the only advantage of UTF-8 is that ASCII maps to it reasonably well.

Yep, and it's a HUGE advantage, I think it accounts for much of the success of unicode adoption.

[+] nabla9|11 years ago|reply
User-perceived characters are not graphemes, they are grapheme clusters.

You can look at unicode stings in at least four different levels of abstraction: bytes, code points, code units and grapheme clusters. Only advantage UTF-32 has over others is that allcode units fit into single code point (atleast I think so)

If you want a vector where each user-perceived character and whitespace matches one element in the vector, probably the easiest way is to create vector where each element is short unicode string that matches grapheme cluster.

[+] deathanatos|11 years ago|reply
> 1) Why on earth does any higher-level language still use byte or codepoint counts for length?

For the higher-level languages, I believe both Haskell and Python¹ now return code point lengths when their length function is called on a string.

> And why don't lower-level languages at least have a way to count / index by graphimes?

Even getting a code point count is difficult in most of those, sadly.

> If you need random access to a string

I really think that random access is not something you greatly need for working with strings, and that most operations are going to scan (linearly) into the string. (For example, splitting on a character requires first finding that character, which is a linear scan that can return an iterator to that position: random indexing is not required.) Sadly, most languages I've worked with, with the exception of C++, do not make great use of the concept of iterators.

¹A recent version of Python 3 is required.

[+] grimgrin|11 years ago|reply
Love Unicode? Then Butts Institute may be for you!

http://butts.institute

But in all seriousness, you may enjoy this thing my friend cooked up.

"With over a million billion codepoints, Unicode offers a vast array of unique characters — perfect for microblogging. [Butts Institute] helps you keep your own personal Unicode character updated, instantly, as often as you like! It's fast, convenient, fun, social, and totally free!"

Just make an options request to:

curl -X OPTIONS http://u.butts.institute

https://gist.githubusercontent.com/shmup/e92dad275bcca9287aa...

[+] asgard1024|11 years ago|reply
Unicode kinda jumped the shark with all the emoji.. they might as well encode all frequent words/meanings.

Though I like them. The only emoticon(s) I always miss is for one "shrug", something like either "I don't know" or "I don't care".

[+] thristian|11 years ago|reply
One of the principles of Unicode is 1:1 round-trip-encoding to every other encoding, which is why there's so many pre-composed letters and accents even though the individual letters and accents are available separately.

All the encoded emoji have been used for decades by Japanese phones; it's certainly not as well-thought-out as the rest of Unicode, but Apple and Google needed to have some common representation of those characters in their Unicode-based software so they could interoperate with Japanese telecommunications infrastructure. Adding emoji to Unicode was the least terrible option.

[+] Animats|11 years ago|reply
I've seen Unicode characters for Facebook, Twitter, etc. icons. So far, they've been in user-defined fonts, in user-defined expansion space. But I suspect there will be pressure to put them in the standard.
[+] beefburger|11 years ago|reply
Another missing symbol is the Unicode logo.
[+] rwg|11 years ago|reply
Another fun Unicode-related bug in OS X 10.9:

    % printf 'Unicode strike\xcd\x9bs again' | LANG=en_US.UTF-8 od -tc
    Assertion failed: (width > 0), function conv_c, file /SourceCache/shell_cmds/shell_cmds-175/hexdump/conv.c, line 137.
    0000000    U   n   i   c   o   d   e       s   t   r   i   k   e 
    zsh: done       printf 'Unicode strike\xcd\x9bs again' | 
    zsh: abort      LANG=en_US.UTF-8 od -tc
I don't know if this is fixed in OS X 10.10 — I filed a bug with Apple a year ago, but it was marked as a duplicate of another bug. The only thing I can see about that other bug is that it's now closed.
[+] kalleboo|11 years ago|reply
Not crashing for me in 10.10

    ~$ printf 'Unicode strike\xcd\x9bs again' | LANG=en_US.UTF-8 od -tc
    0000000    U   n   i   c   o   d   e       s   t   r   i   k   e    ͛  **
    0000020    s       a   g   a   i   n                                    
    0000027
[+] kcdr|11 years ago|reply
Looks like it is fixed

$ printf 'Unicode strike\xcd\x9bs again' | LANG=en_US.UTF-8 od -tc 0000000 U n i c o d e s t r i k e ͛ 0000020 s a g a i n 0000027

[+] walrus|11 years ago|reply
To whoever changed the title: it really was supposed to be "I � Unicode", not "I Love Unicode". The "�" symbol is embedded in the document as a raster image, so the author really meant for it to be that; it wasn't just a font rendering issue on your end.
[+] dang|11 years ago|reply
We changed it back.

Edit: Normally we take attention-grabbing Unicode glyphs out of titles since they disrupt the placid bookishness of HN's front page. But this is one is so tasteful and content-appropriate that it seems obviously a special case.

[+] beefburger|11 years ago|reply
Author here. The title was a pun. � is U+FFFD REPLACEMENT CHARACTER which may possibly replace a badly encoded heart.