Why we can't process Emoji anymore

[+] oofabz|13 years ago|reply

This is why UTF-8 is great. If it works for any Unicode character it will work for them all. Surrogate pairs are rare enough that they are poorly tested. With UTF-8, if there are issues with multi-byte characters, they are obvious enough to get fixed.

UTF-16 is not a very good encoding. It only exists for legacy reasons. It has the same major drawback as UTF-8 (variable-length encoding) but none of the benefits (ASCII compatibility, size efficient).

[+] notJim|13 years ago|reply

This comment is somewhat misleading. The issue at hand is orthogonal to any of the benefits of UTF-8 over UTF-16 (which are real, UTF-8 is great, you should use it.)

4-byte characters in UTF-8 are just as rare as surrogate pairs are just as rare in UTF-16, because they both are used to represent non-BMP characters. As a result, there is software that handles 3-byte characters (i.e., a huge percentage of what you'll ever see), but doesn't handle 4-byte characters.

MySQL is a high-profile example of software which, until recently, had this problem: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8m....

[+] pixelcort|13 years ago|reply

The problem with UTF-8 is that lots of tools have 3 byte limits, and characters like Emoji take up 4 bytes in UTF-8.

[+] est|13 years ago|reply

UCS2 is better than UTF8 internally because it counts unicode characters faster than UTF8. Every character is juest 2 bytes, instead of 1, 2, 3 or even 4 bytes.

In python:

    len(u'汉字') == 2
    len( '汉字') == 4 # or maybe 6, it varies based on console encoding and CPython options
    len(u'汉字'.encode('utf8')) == 6

[+] ender7|13 years ago|reply

Apropos: http://mathiasbynens.be/notes/javascript-encoding

TL;DR:

- Javascript engines are free to internally represent strings as either UCS-2 or UTF-16. Engines that choose to go USC-2 tend to replace all glyphs outside of the BMP with the replacement char (U+FFFD). Firefox, IE, Opera, and Safari all do this (with some inconsistencies).

- However, from the point of view of the actual JS code that gets executed, strings are always UCS-2 (sort of). In UTF-16, code points outside the BMP are encoded as surrogate pairs (4 bytes). But -- if you have a Javascript string that contains such a character, it will be treated as two consecutive 2-byte characters.

  var x = '𝌆';
  x.length; // 2
  x[0];     // \uD834
  x[1];     // \uDF06

Note that if you insert said string into the DOM, it will still render correctly (you'll see a single character instead of two ?s).

[+] praptak|13 years ago|reply

Sometimes you need to know about encodings, even if you're just a consumer. Putting just one non 7-bit character in your SMS message will silently change its encoding from 7-bit (160 chars) to 8-bit (140 chars) or even 16 bit (70 chars) which might make the phone split it into many chunks. The resulting chunks are billed as separate messages.

[+] fwr|13 years ago|reply

On iOS, using any non-basic Latin character in SMS makes it switch to 16 bit, even when there is no reason for that to happen. It's a thing that most foreign language speakers must live with.

By doing this full of excuses write-up, this guy wasted a substantial amount of time that he could have spent better researching the issue. Your consumer doesn't care that Emoji is this much or that much bits, it doesn't matter for him that you're running your infrastructure on poorly chosen software - there is absolutely no excuse for not supporting this in a native iOS app, especially now that Emoji is so widely used and deeply integrated in iOS.

How is that a problem they are focusing on, anyway, when their landing page features awful, out of date mockups of the app? (not even actual screenshots - notice the positions of menu bar items) They are also featuring Emoji in every screenshot - ending support might be a fresh development, but I still find that ironic.

[+] jrabone|13 years ago|reply

The GSM 03.38 charset specified for SMS is not straight 7-bit ASCII. See eg. http://www.dreamfabric.com/sms/default_alphabet.html

[+] pjscott|13 years ago|reply

The quick summary, for people who don't like ignoring all those = signs, is that V8 uses UCS-2 internally to represent strings, and therefore can't handle Unicode characters which lie outside the Basic Multilingual Plane -- including Emoji.

[+] bbotond|13 years ago|reply

Honestly that's a shame.

[+] driverdan|13 years ago|reply

If you search for V8 UCS-2 you'll find a lot of discussion on this issue dating back at least a few years. There are ways to work around V8's lack of support for surrogate pairs. See this V8 issue for ideas: https://code.google.com/p/v8/issues/detail?id=761

My question is why does V8 (or anything else) still use UCS-2?

[+] gsnedders|13 years ago|reply

The ES5 spec defines a string as being a series of UTF-16 code-units, which inherently means surrogates show through.

APIs like that tend to be low priority because they aren't used by browsers (which pass everything through as UTF-16 code-units, typically treating them as possibly-valid UTF-16 strings).

[+] masklinn|13 years ago|reply

> My question is why does V8 (or anything else) still use UCS-2?

Because the ES spec defines a string as a sequence of UTF-16 code units (aka UCS-2-with-visible-surrogates), because as many others (e.g. Java) the language's strings were created during/inherited from Unicode 1.0 which fit in 16 bits (UTF-16 is a retrofitting of Unicode 1.0 fixed-width to accomodate the full range of later unicode version by adding surrogate pairs)

[+] gkoberger|13 years ago|reply

Took me a bit to realize that this is talking about the Voxer iOS app (http://voxer.com/), not Github (https://github.com/blog/816-emoji).

[+] whit537|13 years ago|reply

Yeah, I was worried there for a sec. :^)

[+] hkmurakami|13 years ago|reply

>Wow, you read though all of that? You rock. I'm humbled that you gave me so much of your attention.

That was actually really fun to read, even as a now non-technical guy. I can't put a finger on it, but there was something about his style that gave off a really friendly vibe even through all the technical jargon. That's a definite skill!

[+] jgeorge|13 years ago|reply

DeSalvo's source comments have always been an entertaining read. :)

[+] beaumartinez|13 years ago|reply

This is dated January 2012. By the looks of things, this was fixed in March 2012[1]

[1] https://code.google.com/p/v8/issues/detail?id=761#c33

[+] Cogito|13 years ago|reply

I wonder if this has been rolled into Node yet.

[edit] Node currently uses V8 version 3.11.10.25, which was released after this fix was made, but not sure if the fix was merged to trunk

[edit2] actually, looks like it has, though I can't identify the merge commit

[+] ricardobeat|13 years ago|reply

Please, if you're going to post text to a Gist at least use the .md extension:

https://gist.github.com/4151124

[+] ctrlaltesc|13 years ago|reply

Which enables an even more readable layout with gist.io http://gist.io/4151124

[+] pbiggar|13 years ago|reply

A couple of reasons why it makes sense for V8 and other vendors to use UCS2:

- The spec says UCS2 or UTF16. Those are the only options.

- UCS2 allows random access to characters, UTF-16 does not.

- Remember how the JS engines were fighting for speed on arbitrary benchmarks, and nobody cared about anything else for 5 years? UCS2 helps string benchmarks be fast!

- Changing from UCS2 to UTF-16 might "break the web", something browser vendors hate (and so do web developers)

- Java was UCS2. Then Java 5 changed to UTF-16. Why didn't JS change to UTF-16? Because a Java VM only has to run one program at once! In JS, you can't specify a version, an encoding, and one engine has to run everything on the web. No migration path to other encodings!

[+] cmccabe|13 years ago|reply

UCS2 allows random access to characters, UTF-16 does not.

I'm not sure if that's really true. On IBM's site, they define 3 levels of UCS-2, only one of which excludes "combining characters" (really code points).

http://pic.dhe.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%...

If you have combining characters, then you can't simply take the number of bytes and divide by 2 to get the number of letters. If you don't have combining characters, then you have something which isn't terribly useful except for European languages (I think?)

Maybe someone more familiar with the implementation can describe which path they actually went down for this... given what I've heard so far, I'm not optimistic.

[+] languagehacker|13 years ago|reply

We seem to be seeing this more and more with Node-based applications. It's a symptom of the platform being too immature. This is why you shouldn't adopt these sorts of stacks unless there's some feature they provide that none of the more mature stacks support yet. And even then, you should probably ask yourself if you really need that feature.

[+] fusiongyro|13 years ago|reply

According to Cogito, this was fixed in March:

http://news.ycombinator.com/item?id=4834731

I want to agree with you simply because I don't like Node, but it's hardly fair to damn something over a bug that was fixed 9 months ago.

[+] freedrull|13 years ago|reply

Why on earth would the people who wrote V8 use UCS-2? What about alternative JS runtimes?

[+] marshray|13 years ago|reply

Because Unicode was sold to the world's software developers as a fixed-width encoding claiming 16 bits would be all we'd ever need.

[+] eps|13 years ago|reply

They control their clients, so they could've just re-encoded emojies with custom 16bit escaping scheme, make the backend transparently relay it over in escaped form and decode it back to 17bits at the other end.

Or am I missing something obviuos here?

[+] kstenerud|13 years ago|reply

Small nitpick, but Objective-C does not require a particular string encoding internally. In Mac OS and iOS, NSString uses one of the cfinfo flags to specify whether the internal representation is UTF-16 or ASCII (as a space-saving mechanism).

[+] dgreensp|13 years ago|reply

The specific problems the author describes don't seem to be present today; perhaps they were fixed. That's not to say this conversions aren't a source of issues, just that I don't see any show-stopper problems currently in Node, V8, or JavaScript.

In JavaScript, a string is a series of UTF-16 code units, so the smiley face is written '\ud83d\ude04'. This string has length 2, not 1, and behaves like a length-2 string as far as regexes, etc., which is too bad. But even though you don't get the character-counting APIs you might want, the JavaScript engine knows this is a surrogate pair and represents a single code point (character). (It just doesn't do much with this knowledge.)

You can assign '\ud83d\ude04' to document.body.innerHTML in modern Chrome, Firefox, or Safari. In Safari you get a nice Emoji; in stock Chrome and Firefox, you don't, but the empty space is selectable and even copy-and-pastable as a smiley! So the character is actually there, it just doesn't render as a smiley.

The bug that may have been present in V8 or Node is: what happens if you take this length-2 string and write it to a UTF8 buffer, does it get translated correctly? Today, it does.

What if you put the smiley directly into a string literal in JS source code, not \u-escaped? Does that work? Yes, in Chrome, Firefox, and Safari.

[+] jruderman|13 years ago|reply

The invisible smiley was a font system problem, fixed in Firefox 19 Aurora (assuming you're on Mac).

https://bugzilla.mozilla.org/show_bug.cgi?id=715798

[+] dale-cooper|13 years ago|reply

The UCS-2 heritage is kind of annoying. In java for example, chars (the primitive type, which the Character class just wraps) are 16 bits. So one instance of a Character may not be a full "character" but rather a part of a surrogate pair. This creates a small gotcha where the length of a string might not be the same as the amount of characters it has. And that you just cant split/splice a Character array naively (because you might split it at a surrogate pair).

[+] masklinn|13 years ago|reply

Which, at the end of the day, doesn't really matter since a code point is not a "character" in the sense of "the smallest unit of writing" (as interpreted by an end-user): many "characters" may (depending on the normalization form) or will (jamo) span multiple codepoints. Splitting on a character array is always broken, regardless of surrogate pairs.

[+] eloisant|13 years ago|reply

Maybe nickpicking but I don't think Softbank came up with the Emoji. Emoji existed way before Softbank bought the Japanese Vodaphone, and even before Vodaphone bought J-Phone.

So emoji were probably invented by J-Phone, while Softbank was mostly taking care of Yahoo Japan.

[+] adrianpike|13 years ago|reply

Here's the thread in the v8 bug tracker about this issue: http://code.google.com/p/v8/issues/detail?id=761

Is there a reason that the workaround in comment 8 won't address some of these issues?

[+] clebio|13 years ago|reply

Somewhat meta, but this would be one where showing subdomain on HN submissions would be nice. The title is vague enough that I assumed it was something to do with _Github_ not processing Emoji (which would be sort of a strange state of affairs...).

[+] pla3rhat3r|13 years ago|reply

I love this article. So often it has been difficult to explain to people why one set of characters can work while others will not. This lays out some great historical info that will be helpful going forward.

152 comments