top | item 1084690

Unicode nearing 50% of the web

43 points| wglb | 16 years ago |googleblog.blogspot.com | reply

29 comments

order
[+] viraptor|16 years ago|reply
As a person with a unicode (or iso-which-noone-ever-uses) first name - I'd like to thank all of you who enabled unicode on your databases and pages. It makes my life much easier and pleasant if I can use my real name during registration. Even if I keep getting my snail mail with '?'s, '+'s or urlencoded...
[+] quant18|16 years ago|reply
I'd really rather go back to the old way when there were lots of competing national encodings for each language, and the actual users of that language could vote with their web pages/documents for which one they preferred. Instead we have this one overarching encoding whose subparts were fixed for all time by fiat from committees before being put into practical use, and as you might expect, some of those committees really screwed things up.

For example, some fine upstanding gentlemen decided that in Unicode (and GB-18030), Mongolian ᠣ and ᠤ, which are printed/handwritten exactly the same, shall be two "different letters" U+1823 and U+1824, but the different forms of ᠳ are the "same letter" U+1833. (And of course, there's ᡩ U+1869 which looks like what you want for some forms of U+1833, but you're not supposed to use it because it's "only for Xibe").

The closest analogy I can give in English is an encoding which forced you to use different "a" codepoints for the characters in apple vs. fake because of their different pronunciations, while making a single codepoint for "k" and "ck" and "c" (but only sometimes) because they sound the same. If you ever saw an encoding like that, you'd no doubt say to yourself: "WTF? I'm not using this, I'll stick with ASCII/EBCDIC/Morse code, thank you very much".

[+] viraptor|16 years ago|reply
Please no... It ended up with a simple language like Polish having latin2, cp1852 (or something like that), mazovia, mazovia2, and probably some more homebrew encodings. I can't even imagine what would happen for completely different scripts like Indian.

There are problems with unicode - ok, let's resolve them then. I still want to be able to address my email to the real name of person named in language A, living under address of country B, signing the email properly in language C. (where all parts use language-specific characters) Unicode is the first standard which allows me to do that in most cases, so I guess it's a step in the right direction.

[+] mooism2|16 years ago|reply
By "Unicode" they mean "UTF-8".
[+] pmjordan|16 years ago|reply
None of the other explicitly listed encodings are unicode encodings, and the "other" category is tiny. So the statement is still true. Some browsers don't even support other unicode encodings, so this doesn't surprise me. UTF-16 is the only one that even stands a chance; I've never seen UTF-32 used for files, and I've never seen UTF-7 used at all. I suspect UTF-16 is more efficient than UTF-8 for east Asian scripts, but that advantage probably dwindles when content is gzipped.

It's good news that Google are now decomposing ligature codepoints, although I do wish they had a version of their search that was literal; especially with programming-related and other technical searches, the special characters it filters out are often crucial.

[+] ssp|16 years ago|reply
Well, UTF-8 is rather brilliant in many ways.

It can represent all of Unicode while remaining backwards compatible with ASCII and C string representations.

The drawback, that characters have variable encoding, is almost irrelevant, because if you are using character indices, you are almost certainly doing it wrong anyway.

[+] spolsky|16 years ago|reply
Yes. The distinction between Unicode and UTF-8 is pretty important; Unicode IS NOT AN ENCODING.
[+] thirdstation|16 years ago|reply
I wonder how many of those sites are pushing Unicode without knowing it. I still see programmers and non-programmers scratching their heads over character encoding issues.
[+] zmimon|16 years ago|reply
I reckon there's a whole bunch pushing non-unicode and not knowing it, but declaring UTF-8 anyway. Since the current versions of PHP don't even support unicode (at least, without going to very special pains) I suspect there are an awful lot of web sites just shoving out content in non-unicode formats, calling it UTF-8 and wondering why every now and then they see a funny question mark in someone's name etc.
[+] happenstance|16 years ago|reply
Why does MySQL still default to latin1? (Or am I mistaken?)
[+] ars|16 years ago|reply
Just run:

ALTER DATABASE database_name DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;

And you don't have to worry about it for any new tables. (If you have existing ones, you'll have to change the table default, and possibly the column too.)

[+] wvenable|16 years ago|reply
Because changing the defaults might mean a world of hurt for people not expecting it. You should just always be explicit.
[+] gchpaco|16 years ago|reply
It's interesting that UTF-8 has largely achieved its success amongst English speakers and Latin-1 languages--almost everything else has remained more or less static or has a slow downward trend. Since we're not seeing an increase in UTF-16 or UCS-2 or anything else like that, this would seem evidence that the Web, or at least Google's view of it, is becoming even more increasingly dominated by Western European languages, which is itself an interesting idea.
[+] jmillikin|16 years ago|reply

    Since we're not seeing an increase in UTF-16 or UCS-2 or anything else like that, this would seem evidence that the Web, or at least Google's view of it, is becoming even more increasingly dominated by Western European languages, which is itself an interesting idea.
This assumes that new pages (in languages with non-Latin scripts) are likely to use national encodings, which (in my experience) is not true. Given any popular* website written in Arabic, Cyrillic, Hangul, Kanji, etc, chances are that it will be in UTF-8. The only times I see encodings like SJIS, KOI8, or ISO-8859-1 any more are in old old pages, written before mass popularity of internet, or on amateur/very small websites. The former were created before UTF-8, the latter by unskilled or inexperienced authors.
[+] pmjordan|16 years ago|reply
I don't see how you're inferring language share within UTF-8. Being a unicode encoding, it can be used to represent text in all popular scripts; that's kind of the point.