Plain Text Doesn’t Exist: Unicode and encodings demystified

[+] ghc|12 years ago|reply

Contrary to the editorialized title (which I'm sure will be changed soon), here's the best article on Unicode I've ever read: http://www.joelonsoftware.com/articles/Unicode.html

[+] stormbrew|12 years ago|reply

I feel like we're at a point now where articles that just try to 'demystify' unicode are almost teaching the controversy if they don't come out and actually say how you should deal with encodings in new apps.

It's about time we actually start pressing for the idea that utf-16 was a terrible idea and that utf-8 should be the dominant wire format for unicode, with ucs4 if you really need to have a linear representation.

Utf-16 is confusing, complicated, and implementations are routinely broken because the corner cases are rarer. I really hope we're not still stuck with it in 50 years.

[+] millstone|12 years ago|reply

I agree about wire formats. However, programming languages that use UTF-8 as the primary string representation tend to have inferior Unicode support to those that use UTF-16. I think this is because with UTF-8, a lot of stuff just seems to work and so it's easy to ignore the issues, while UTF-16 forces you to come to grips with the realities of Unicode.

For example, consider a function to open a file by name. If your strings are UTF-8, you can just pass your null terminated buffer to fopen() or something and things will work fine for most files. But if your strings are internally UTF-16, you have to think about which encoding to use, and you research it, and you discover that holy crap, this stuff differs across OSes, and so we better take this problem seriously.

[+] aerolite|12 years ago|reply

More ethnocentrism. UTF-16 results in less bandwidth for any language that has non-ASCII characters.

[+] salmonellaeater|12 years ago|reply

What is the argument for utf-8 that doesn't also apply to utf-16, depending on what your common characters are? Applications that use Asian languages take a big hit from utf-8.

[+] SonOfLilit|12 years ago|reply

You're probably looking for this:

http://www.utf8everywhere.org/

[+] joel_perl_prog|12 years ago|reply

Interesting talk by Nick Patch of Shutterstock about Unicode, why they use Perl, and what he calls best practices for working with encodings: http://www.youtube.com/watch?v=X2FQHUHjo8M

[+] greenyoda|12 years ago|reply

HN guidelines ask you to retain the original title of the article, not to inject your own editorial opinion[1]. This article is a decent quick overview for someone who has never studied Unicode before, but a bit light on the details compared to what's available, say, in Wikipedia's articles on Unicode[2].

[1] http://ycombinator.com/newsguidelines.html

[2] https://en.wikipedia.org/wiki/Unicode

[+] KayEss|12 years ago|reply

>Other common myths include: Unicode can only support characters up to 65,536

Not really a myth. This was UCS2 and was the situation when a load of important early adopters started with Unicode. Winodws, Java, JavaScript all got burnt by this and ended up with UTF-16 as a result. Even Python 2.x on Linux is UTF-16 under the covers :(

>Unicode is just a standard way to map characters to magic numbers and there is no limit on the number of characters it can represent.

Unicode now limits itself to 21 bits of data. This is what allows the surrogate pair coding of UTF-16

[+] cbr|12 years ago|reply

At first there was no plain text. Then ASCII became standard enough and if you passed a file to someone they would be able to read it without you needing to tell them an encoding. Extensions of ASCII and other options spread, and there was no way to have plain text include characters beyond the 128 of ASCII. Over the past decade or so we've been consensing on utf8, however, and the probability that some random piece of non-ASCII text is utf8 has been getting sufficiently high.

There is plain text, and it is utf8.

[+] Argorak|12 years ago|reply

By the way: A very interesting article for those interested in a bit of context around unicode and how it doesn't solve everything (especially from an asian perspective) is this one:

http://web.archive.org/web/20090627072117/http://www.jbrowse...

[+] taspeotis|12 years ago|reply

Character encodings are a pain in the ass. If you want some examples, Michael Kaplan posts [1] about these sorts of things in way too much detail.

[1] http://blogs.msdn.com/b/michkap/archive/tags/unicode+lame+li...

[+] TheZenPsycho|12 years ago|reply

this seems strangely similar to Joel Spolsky's article on unicode.

http://www.joelonsoftware.com/articles/Unicode.html

Though not exactly plagiarised, it shares some obvious parallels.

[+] derleth|12 years ago|reply

> The original ASCII standard defined characters from 0 to 126.

0 to 127. 127 is a power of 2 minus 1, which should be a hint; in specific, it's two to the seventh minus one, since ASCII defines codepoints for all possible combinations of seven bits, which is 128 possible codepoints, so the enumeration ends at 127 if you count starting from zero, as computer programmers are wont to do.

> all possible 127 ASCII characters

128 characters, as mentioned above.

> the ASCII guys, who by the way, were American

ASCII stands for American Standard Code for Information Interchange. The ethnocentrism was unfortunate but it isn't like you weren't warned.

> The numbers are called “magic numbers” and they begin with U+.

He can call them "magic numbers" but everyone else calls them "codepoints".

> UTF-8 was an amazing concept: it single handedly and brilliantly handled backward ASCII compatibility making sure that Unicode is adopted by masses. Whoever came up with it must at least receive the Nobel Peace Prize.

I'm sure Ken Thompson and Rob Pike will be happy to hear someone thinks that way.

[+] lutusp|12 years ago|reply

> 0 to 127. 127 is a power of 2 minus 1, which should be a hint; in specific, it's two to the seventh minus one ...

To expand on this a tiny bit, a power of two minus one is called a "Mersenne number". If the number is prime, it's called ... wait for it ... a Mersenne prime.

When expressed in binary, Mersenne numbers are an uninterrupted series of one digits: 11111... of varying lengths.

http://mathworld.wolfram.com/MersenneNumber.html

[+] spacehunt|12 years ago|reply

>> UTF-8 was an amazing concept: it single handedly and brilliantly handled backward ASCII compatibility making sure that Unicode is adopted by masses. Whoever came up with it must at least receive the Nobel Peace Prize.

> I'm sure Ken Thompson and Rob Pike will be happy to hear someone thinks that way.

If that's the case then I think the designers of GB18030 deserve it more, because they achieved an encoding that is able to map all Unicode codepoints while being backwards compatible with GB2312, which is itself backwards compatible with ASCII.

But seriously UTF-8 is like sliced bread after having dealt with we-thought-64K-is-enough-so-lets-all-use-16-bits UCS-2/UTF-16.

[+] yuhong|12 years ago|reply

>ASCII stands for American Standard Code for Information Interchange. The ethnocentrism was unfortunate but it isn't like you weren't warned.

Yea, I think it reflects how much technology development comes from America even back then as well as now.

[+] namsral|12 years ago|reply

Rob Pike's story on how he and Ken Thompson invented UTF-8 over dinner:

https://plus.google.com/101960720994009339267/posts/Rz1udTvt...

38 comments