How a comment on Hacker News led to 4½ new Unicode characters

[+] pja|9 years ago|reply

I got a change into Unicode 9.0 too!

It was just a tweak to emoji characters to mark them all as East Asian Full Width instead of Narrow or Ambiguous so that they displayed correctly when using a fixed width font in a terminal console. This probably only matters if you like to use emoji filenames (you mad person), but it felt like a wart so I reported it & had a short back and forth with the chair of the emoji-related subcommittee which resulted in a proposal which was eventually accepted by the committee into Unicode 9.0. The committee were great: took my tiny bug report seriously, wrote huge long treatises to justify the change & eventually voted it into the standard.

(This was pretty much my peak geek achievement of 2016 so far :) )

[+] rspeer|9 years ago|reply

Holy crap I appreciate this change! I thought they'd never fix it because of compatibility. Thanks for the effort you put in.

It's not that I use emoji filenames, it's that I deal with real-world natural language text all the time, including at the console.

(In terms of compatibility, my text-justifying function is going to stop working correctly for the period of time between when gnome-terminal updates to Unicode 9 and when Python 3.x does. Still worth it.)

[+] micro2588|9 years ago|reply

Will this eventually solve the "Julia does not like Pizza" issue (https://github.com/JuliaLang/julia/issues/3721)?

[+] Sharlin|9 years ago|reply

It also matters if you want to use emojis in irssi or equivalent :)

[+] yuhong|9 years ago|reply

The fun thing is that I don't think the Windows console can handle anything outside BMP.

[+] kens|9 years ago|reply

The success of the unicodepowersymbol proposal inspired me to suggest a couple characters to Unicode (the Bitcoin sign and IBM's group mark from 1960s mainframes, which were accepted). The point is that Unicode really is open to proposals from random people; you don't need to part of a big company to influence Unicode.

[+] hypertexthero|9 years ago|reply

Agreed.

Although I had never worked with the Unicode Consortium, I [submitted a proposal][1] for an international symbol for an observer and it was eventually accepted.

[1]: http://hypertexthero.com/logbook/2015/01/international-symbo...

[+] MichaelGG|9 years ago|reply

You don't need to be part of a big company, but it certainly helps. Especially if you want to stop a pentathalon or rifle character.

(Top result: http://www.cbc.ca/news/trending/rifle-emoji-dropped-unicode-...)

[+] rbanffy|9 years ago|reply

Just added the group mark (and these power symbols) to https://github.com/rbanffy/3270font/tree/develop. It was fun. Thanks for giving me an excuse.

http://imgur.com/Sx0lkM8

[+] PhasmaFelis|9 years ago|reply

It's really cool than these things can happen. Still, I can't help but feel like, in a couple of centuries, Unicode is gonna be a goddamn mess.

[+] kristianp|9 years ago|reply

What's the IBM group symbol? I can't find the words IBM in [1].

[1] http://www.unicode.org/charts/PDF/U2B00.pdf

[+] intopieces|9 years ago|reply

You don't even need to be part of a big company to go to their meetings. I went to their conference in San Jose just to see the proposal for the Chinese Take-out box/chopsticks/fortune cook emojis. I met a ton of nice people, too.

[+] paxcoder|9 years ago|reply

The company logo? Why? Would you please get the copyleft glyph in it? Seems more important.

[+] tangus|9 years ago|reply

How did the Unicode Consortium turn around. I remember 10 years ago they were refusing to add standard media icons because

>The scope of the Unicode Standard (and ISO/IEC 10646) does not extend to encoding every symbol or sign that bears meaning in the world.

>This list has been round and round and round on this -- regular as clockwork, about once a year, the topic comes up again. And I see no indication that the UTC or WG2 are any closer to concluding that bunches of icons should start being included in the character encoding standards simply on the basis of their being widespread and recognizable icons.

>Where is the defensible line between "Fast Forward" and "Women's Restroom" or "Right Lane Merge Ahead" or "Danger Crocodiles No Swimming"?

(http://www.unicode.org/mail-arch/unicode-ml/y2005-m08/0371.h...)

Now it looks they add whatever somebody thinks of. I guess it's related to the liberation from the BMP.

[+] shakethemonkey|9 years ago|reply

>The scope of the Unicode Standard (and ISO/IEC 10646) does not extend to encoding every symbol or sign that bears meaning in the world.

Until Unicode has a half-star character, it won't even be able to encode the average newspaper.

[+] kens|9 years ago|reply

Unicode is supposed to include symbols that appear in "running text", not standalone icons. So no on traffic signs for instance. (There are exceptions for historical reasons. And emoji are a totally separate story.)

[+] acz|9 years ago|reply

Let’s start working on "SVG over UTF" RFC, should we?

[+] hackuser|9 years ago|reply

As the story mentions regarding the off symbol (a circle), there are many visually identical code points that have different semantic meanings. But in this case, they added an additional semantic meaning to an existing code point.

So which is it? Does each code point represent a visual image? A semantic meaning? Both? It depends? Something else?

I've tried to decipher that on my own and only learned that the answer to these sorts of questions are complicated, because it's very complicated to represent all written human language via one set of rules.

So I know some of the answers to my questions above, but I'm hoping someone with real expertise can provide the fundamental rules/policies - if there are any.

[+] peterburkimsher|9 years ago|reply

TIL that the SI units all have Unicode symbols. http://www.marathon-studios.com/unicode/categories/So/Other_...

If people actually used these, it would make searching text for formulae much easier. Wikipedia editors and academic publishers, please note.

Also, there's no Unicode for screwdriver. Perhaps iFixit would like to campaign for that?

Congratulations on getting the power symbols in! When @edent writes "Will update ... when I stop dancing", was it "I got the power"?

[+] agumonkey|9 years ago|reply

I'm a bit confused about Unicode. It was a repository of linguistic symbols, not raw symbols. More and more it looks like wingdings. Isn't this putting burden on font support and Text processing (what's the lexicographic order of such symbols, using the abstract name ?) ?

[+] wmil|9 years ago|reply

They want every symbol used in a document to have a unique encoding, so that you can change fonts without losing meaning. Fonts like wingdings are a horrible hack.

The idea is one (complex) encoding that will represent the info until the end of time. It creates a lot of trouble, but it's still a good idea.

[+] Arnt|9 years ago|reply

Go read http://unicode.org/standard/principles.html. The committee follows its principles intelligently.

In this case, the codepoints were added in part because the proposers could show many printed works (user manuals, I guess) that included sentences such as "to turn the foobar on, press the ■ button", which shows that the glyph between "the" and "button" is in some way like the surrounding glyphs. Chessmen were added for similar reasons, even though very few people actually read either user manuals or chess literature.

[+] the_mitsuhiko|9 years ago|reply

The difference between an icon and a letter is small and unclear. & is a symbol but was considered a letter as an example. Chinese characters are words etc.

[+] gkya|9 years ago|reply

That's what happens when you put good things at the hands of WWW.

[+] cheez|9 years ago|reply

We may think that we are enlightened beings but the fact is that pictures comprise a lot of how we communicate now and in the past. Are emojis that different from hieroglyphics?

[+] mcv|9 years ago|reply

I agree. I do see the usefulness of these symbols, but I'm not sure why emojis need official support.

[+] Semiapies|9 years ago|reply

Last I checked, Unicode don't actually have anything like coverage of the entirety of every script and alphabet. On the other hand, approving emoji and random icons delights Westerners.

[+] Animats|9 years ago|reply

But why? The trend towards putting icons into Unicode may be a mistake. Unless it's a symbol one uses in a sentence, there's no real reason to have it in Unicode. Unicode should not be viewed as a standard clip art library.

[+] yk|9 years ago|reply

I already ranted about unicode earlier today, my main argument is, that unicode is what happens if everybody qualified thinks: "That's a great idea, of course you have to handle X and Y and Z and I just remember that I forgot to fill out several warranty cards."

This blog post is a nice example, I have absolutely no idea how these new code points are supposed to look like, since I only spend an afternoon to implement the unicode best practices from the Arch wiki, instead of subscribing to some unicode standard mailing list. (Except the one symbol which was redefined to a symbol that does not carry the semantic meaning of "standby symbol" anywhere outside of the unicode standard.)

In my opinion there are two ways forward, one burn the entire thing. Or alternatively, force the unicode committee to produce an authoritative and complete font, in triplicate, and in their own blood.

[+] johncolanduoni|9 years ago|reply

The Unicode tables include examples for all graphical code points: http://unicode.org/charts/. If you really wanted you can make them into a font (most of them seem to be vectorized), but since I'm guessing you see most of the added code points as useless why do you care if they show up as boxes? What harm is this stuff causing or going to cause to the standard? We have hundreds of thousands of unassigned code points.

Meanwhile, a lot of the "Ys and Zs" added to Unicode have proved to be extremely useful. Unicode's math operator and letter-styling support is what made MathJax (and more generally MathML) possible. They've also helped big time when it comes to accessibility (e.g. screen readers) for mathematics on the internet. Should we have shunted that off to another standard and made the creators of screen readers completely restructure their offerings so they can deal with Unicode characters and "Mathicode" characters? Assuming anyone bothered to implement it, how would that be better than just adding a Unicode category and spending a meager amount of space?

[+] pron|9 years ago|reply

I think that the BMP -- Basic Multilingual Plane or the first 16-bit of Unicode characters -- is pretty reasonable, and covers fairly well everything we may consider as text (all alphabets in current use plus mathematical symbols). Anything beyond that, from emojis and pictograms to ancient Greek musical notation is pretty... weird.

I think it would have made much more sense to have something like image tags: a special codepoint would introduce a link to a URL containing a sequence of glyphs, followed by an index into that sequence. Those glyphs would be guaranteed not to change (in any meaningful way), and devices would be free to cache them. This way, anything that isn't real text, would standardize representation, too, instead of just a vague "meaning". Another standard could relate those glyphs to one another in some way, giving them standard semantics and means of translation (i.e. "Egyptian hieroglyphics"). This would also allow each of those (emojis or hieroglyphics) to evolve their standards independent of a single universal standard that means little.

[+] detaro|9 years ago|reply

Well, if you want to know what they can look like, the blogpost has images, an embedded webfont and links to the reference font for the new symbols. And AFAIK providing reference images that are freely usable is required for all new symbol proposals.

[+] wooptoo|9 years ago|reply

Legitimate question: Why is Unicode littered with all those useless symbols?

I can see the reasoning behind the standard (or very common) symbols or things like emoji, but having every possible glyph in UTF8 seems like a horrible waste.

What if we want to add new glyphs in the next 10 years for emerging standards?

[+] onion2k|9 years ago|reply

having every possible glyph in UTF8 seems like a horrible waste

A horrible waste of what? Unicode 9.0 encodes 128,172 characters, of a possible total 1,112,064 code points. The addressable space is 11.52% full. Clearly there's enough left to keep adding more and more characters for a really long time.

If your complaint is that it's a waste of resources, time, etc - surely it's up to the people who are members of the consortium to decide how they want to spend their energy?

[+] kbart|9 years ago|reply

The usefulness of symbols is subjective. For example, I personally find all these emoji useless, and you clearly stated opposite.

[+] nabla9|9 years ago|reply

At some point, someone realizes that there is need to standardize fixed practical subset of Unicode that contains all essential symbols over the world so that all devices that comply with the standard can __actually__ interchange text in readable, printable and visually presentable form.

It's nice to have catalogue of symbols and tight encoding for them, but full support of Unicode encoding has very little to do with support for Unicode in an application.

🇦 🇧 🇨 🇩 🇪 🇫 🇬 🇭 🇮 🇯 🇰 🇱 🇲 🇳 🇴 🇵 🇶 🇷 🇸 🇹 🇺 🇻 🇼 🇽 🇾 🇿.

[+] kozak|9 years ago|reply

Basic Multilingual Plane?

[+] c3t0|9 years ago|reply

Congratulations! Following through all that work specially with a consortium of such significance is great feat of perseverance.

Thank you for stepping up and making a difference.

[+] edent|9 years ago|reply

Thanks mate :-)

[+] DiabloD3|9 years ago|reply

The only problem I see is OSX/iOS, Windows, and Android don't ship with some universal, but shitty, font that has every single last glyph ever, always immediately updated to the new Unicode standard.

[+] VMG|9 years ago|reply

Hope pause/play/rewind etc come next: http://fontawesome.io/icons/#video-player

[+] nomercy400|9 years ago|reply

You mean 0x23E9 to 0x23FA, just before these new power symbols? I only noticed them because the unicode power symbol site has an image of what comes before their symbols.

[+] dclowd9901|9 years ago|reply

Unicode symbols... seems like we should've developed them the way languages develop: start with the most important symbols, ones for food, water, shelter, danger, etc, then expanded them into the abstract mess they are today.

[+] rabboRubble|9 years ago|reply

Emoji were not developed haphazardly. They evolved naturally in Japan, then were adopted by the rest of the world. That is why there are so many Asia / Japan themes in the standard emoji set. The problem is Westerners don't understand the Japanese emotion behind the symbol. The symbol for bookbag looks exactly like a Japanese school kid's backpack. It's why there is a kimono. Bamboo wind chimes. Tsunami. Shinkansen... I could go on and on.

In some respect, they are getting jumbled up because of international pressures for the base emoji set to be stretched into a be-all for the global market. An example is Taco. There are tacos in Japan. They are hard to find and when you do find one, you definitely don't want to eat one there. Mexican food is one of the rare cuisines the Japanese don't do better.

[+] piotrkubisa|9 years ago|reply

I hope that ligatures will be more popularized than using characters like "½", because it is very difficult to find them in text with standard ASCII characters, i.e. in Firefox by typing 1/2 in quick find (ctrl+f).

[+] systemfreund|9 years ago|reply

From the Unicode 9.0.0 announcement [0]:

> Important symbol additions include:

> 19 symbols for the new 4K TV standard

I am wondering, why did they add symbols for a standard which will become obsolete eventually?

[0]: http://unicode.org/versions/Unicode9.0.0/

[+] bArray|9 years ago|reply

Truly amamzing man!

I was actually wondering about the electrical symbols for logic gates, such as AND, OR, NOR, XOR, NOT, etc. I would hope they were universally accepted by now and would help when writing books or describing logic. A quick Duck Duck search revealed nothing...?

[+] baby|9 years ago|reply

I was wondering why they would have snowmen in the language. And then it occurred to me that maybe, since the unicode set has so much room for characters, that they were planning to allow cross-language communication through emoticons.

Think about it, if you can represent anything human with emoticons. Then you can communicate through emoticons only! Maybe that's what the ancient Egyptians were hopping for?

[+] jgalt212|9 years ago|reply

How long before we need a defusedunicode to protect users and programs from confusion and scams?

https://pypi.python.org/pypi/defusedxml

[+] singularity2001|9 years ago|reply

I first thought they are the ones responsible for this 'character':

﷽ 65021 ﷽ FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM

http://graphemica.com/%EF%B7%BD

413 comments