top | item 12633439

How I added 6 characters to Unicode (and you can too)

129 points| deafcalculus | 9 years ago |righto.com | reply

65 comments

order
[+] 1wd|9 years ago|reply
The rationale given for including mirrored half-stars as separate codepoints is right-to-left languages. I wondered why this was needed, since Unicode already has the a right-to-left mark (RLM)[1].

I found the answer in a comment on "Explain XKCD".[2] The RLM usually only reorders characters, but does not mirror their glyphs. The exception are glyphs with the "Bidi_Mirrored=Yes" property, which are mapped to a mirrored codepoint.[3]

The half-stars proposal includes a note on that property: "Existing stars are in the “Other Neutrals” class, so half stars should probably use the ON bidirectional class. The half stars have the obvious mirrored counterparts, so they can be Bidi mirrored. However, similar characters such as LEFT HALF BLACK CIRCLE are not marked as mirrored. I'll leave it up to the Unicode experts to determine if Bidi Mirrored would be appropriate or not."

[1] https://en.wikipedia.org/wiki/Right-to-left_mark

[2] https://www.explainxkcd.com/wiki/index.php/1137:_RTL

[3] http://www.unicode.org/Public/UNIDATA/BidiMirroring.txt

[+] treve|9 years ago|reply
The one I'm surprised about is not the stars, but actually the bitcoin character. It's just a form of branding to me, and while I think there's interesting uses for blockchain technology, public interest seems to be a bit inflated. Plus that blockchain tech will likely outlive bitcoin itself.
[+] sqeaky|9 years ago|reply
It's not like there is some central Bitcoin company so what is the brand? Brands are generally owned by companies and are intellectual property in the eyes of governments.
[+] justinpombrio|9 years ago|reply
It's a currency. I bet that unicode has glyphs for even more obscure currencies.
[+] nacc|9 years ago|reply
It is great to see Unicode being able to encode almost every symbol people can think of, however I am still struggling to make them appear on my screen - is there a good font that has great coverage for unicode? Many times there are clever use of unicode yet I can only see empty rectangles.
[+] markbao|9 years ago|reply
I love this – but does it bother anyone else that the outlined and filled stars have different sizes? What's the reason behind that?

HN strips the characters out from comments, but they're displayed in the beginning of the article.

[+] treve|9 years ago|reply
Unicode does not dictate how glyphs are presented. It just describes and categorizes them.

So how they look comes from the font that is used. For the proposal these fonts probably didn't exist yet, so it was probably just a (slightly sloppy) photoshop.

[+] doodpants|9 years ago|reply
Wouldn't that depend on the font? They appear the same size in my browser (Firefox), in 15px Arial on Windows 7.
[+] edent|9 years ago|reply
So glad the unicodepowersymbol.com stuff was helpful! We had a lot of fun getting the proposal together.

If anyone wants to submit some new characters, all of our documents are on GitHub https://github.com/jloughry/Unicode

[+] Animats|9 years ago|reply
We need to hold the line somewhere. Preferably before corporate logos get into Unicode. I've seen Facebook and Twitter icons as Unicode characters in the user-definable space. This currently requires a downloaded font, but there's probably some lobbyist somewhere trying to get them into Unicode.

It's getting really complicated. There are now skin-tone modifiers for emoji.

[+] WalterBright|9 years ago|reply
Unicode is turning into a few useful characters amid a sea of junk. This will continue as long as people acquire status by getting "their" symbol(s) into Unicode. I don't see any way this can change.
[+] ygra|9 years ago|reply
Logos can never be encoded because of trademark concerns. So you're safe, there won't ever be a Facebook or Twitter code point.

Skin tone modifiers work pretty much like diacritics already do. It's not complicated and most of the support relies on the font anyway.

[+] amelius|9 years ago|reply
Perhaps we should have an escape code for SVG in Unicode, so we can describe any missing character.
[+] wxs|9 years ago|reply
Unicode Technical Report #51, which is where Emoji are laid out, talks a bit about the current thinking of the committees on this:

> The longer-term goal for implementations should be to support embedded graphics, in addition to the emoji characters. Embedded graphics allow arbitrary emoji symbols, and are not dependent on additional Unicode encoding. Some examples of this are found in Skype and LINE—see the emoji press page for more examples.

> However, to be as effective and simple to use as emoji characters, a full solution requires significant infrastructure changes to allow simple, reliable input and transport of images (stickers) in texting, chat, mobile phones, email programs, virtual and mobile keyboards, and so on. (Even so, such images will never interchange in environments that only support plain text, such as email addresses.) Until that time, many implementations will need to use Unicode emoji instead

[1] http://unicode.org/reports/tr51/#Longer_Term

[+] hf|9 years ago|reply
I simply cannot wrap my head around the direction of the Unicode discourse.

We're discussing the appropriate code-point for different smiley faces, obscure electrical symbols[0] or, in the present case, half stars to express film or book ratings, yet we have no complete set of sub- and superscripts!

Am I mistaken in thinking it odd, that there's a complete Klingon alphabet but no representation whatsoever for most Greek or Latin subscripts? Or what if, heaven forbid, I'd want to use a 'b' index/subscript? Tough! Not even the "phonetic extensions", where subscript-i comes from, provides it.

Refer to https://en.wikipedia.org/wiki/Unicode_subscripts_and_supersc... or look for SUBSCRIPT in http://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

Surely there's the one or two actual scientists on the Unicode consortium? Or even the one odd soul still sporting a notion of consistency who finds it only logical to provide a "subscript b" if there's a "subscript a"?

How am I wrong?

[0] https://news.ycombinator.com/item?id=11958682

[+] jcranmer|9 years ago|reply
Unicode is not known for its consistency in dealing with these issues. The original idea behind Unicode was to be able to represent every then-extant character set with perfect fidelity (i.e., go from X to Unicode and back, and you should get the same data). Why are there letters like U+212B Angstrom sign (not to be confused with U+00C5 Latin capital A with ring above) or things like half-width and full-width characters? Because they were present in Shift-JIS, not because of any coherent notion of what constitutes a glyph. Han unification was driven more by the need to keep from blowing a space budget than by actual rationalization of whether or not the scripts deserved separate spaces.

Note that Klingon isn't in Unicode (it was explicitly rejected by the UTC, with a vote of 9 in favor of the rejection proposal, 0 against it, and 1 abstaining). Tengwar and Cirth, though, are actually considered serious proposals for Unicode, just really, really low priority compared to, say, Mayan script (for which the first proposal should be going live in 2017). Mayan script is interesting in its own right because it's the script (well, of the ones I'm aware of) that most challenges normal conventions on what constitutes letters and glyphs.

[+] jessaustin|9 years ago|reply
ISTM a great deal of trouble and complication could have been prevented by three special types of NBSP that meant "sub", "super", and "back to normal". It's true that some glyphs will be special-cased by some fonts, but in general the glyph is just shrunk and translated when sub- or super-scripted.
[+] 1wd|9 years ago|reply
The Klingon alphabet was proposed but rejected.

Subscript letters were proposed as well: http://www.unicode.org/L2/L2011/11208-n4068.pdf but apparently "Not accepted: Because this has been controversial and is not directly related to repertoire under ballot, it is not appropriate to add it to Amd1 but may be considered for a future amendment" http://www.unicode.org/L2/L2012/12130-n4239.pdf

Looks like here's a recent draft for a new proposal: https://github.com/stevengj/subsuper-proposal

[+] WalterBright|9 years ago|reply
Super/sub scripts are markup, not characters. There shouldn't be any in Unicode.
[+] gjasny|9 years ago|reply
It would be cool to see the powerline symbols to be added to Unicode. The necessary user base should be already there.

See: https://github.com/powerline/fonts/blob/master/README.rst

A zsh theme with those characters in use: https://gist.github.com/agnoster/3712874

[+] yes_or_gnome|9 years ago|reply
I have to disagree. All but 3 of those pictographs are already in the Unicode standard. You have to patch fonts because A) your preferred font may not have them and B) to make certain that the font meets Powerline's expectations.

The ones that are "unique" are a bit annoying because they replace defined characters in the Basic Multilingual Plane - Private Use section(E000-FFFF). Even though the section is "Private Use" it is often already defined by your OS's system font. There's the Supplemental Private Use Areas A (F0000-FFFFD) and B (100000-10FFFF) which can be overwritten safely.

I scare quote "unique" because two of those characters are full-height arrows; one right-pointing, the other left-pointing. These are already defined as u1F780 (🞀) u1F782 (🞂). It may be the case that some fonts that the triangles either A) don't actually go from floor to ceiling, or B) they have empty space behind their hypotenuse.

The only truly unique character is the "git branch" pictograph. Maybe, someone could write up a convincing argument to include it, but I can't imagine one. It's not a symbol you see to often even in the git community. And, I would bet if you looked hard enough, there's some mathematical symbol that would be suitable.

Just FYI, I've used powerline fonts daily for the past ~3 years.

[+] YeGoblynQueenne|9 years ago|reply
That's great but what we really need (ahem- what I really need) is more maths-y characters, like ∑∏∫∀ and all the sub- and super- scripted letters: ⁱⁿₙᵢ and so on.

I can never find a lower-case Greek subscripted α or β when I need one...

[+] JadeNB|9 years ago|reply
> That's great but what we really need (ahem- what I really need) is more maths-y characters, like ∑∏∫∀ and all the sub- and super- scripted letters: ⁱⁿₙᵢ and so on.

Agreed, but what we need even more than the symbols is some ((La)TeXy, says the mathematician) way of combining them. For example (says the mathematician who doesn't understand the complexity of text encodings), why do we need a whole bunch of separate "subscript m", "subscript n", etc., glyphs, rather than just one "subscript" combining mark?

[+] WalterBright|9 years ago|reply
Unicode is a brilliant idea, but it went off the rails with combining characters, especially when there is both a code point for a character and a combining set of characters that semantically are the same thing.
[+] ygra|9 years ago|reply
How would you solve things without combining characters? Especially the case where you can have multiple diacritics on a letter. Encode every single combination of all of them? Seems a bit wasteful, don't you think?

Precomposed characters exist because they existed in other encodings previously and encoding such characters has been one of the core principles of Unicode to ensure an easy upgrade path. Heck, we inherited box drawing characters that way, which I think are more questionable than combining diacritics.

[+] kuschku|9 years ago|reply
Unicode is a fuckton of backwards compatibility. That’s the big reason those things exist.
[+] contingencies|9 years ago|reply
The other day I was searching for the words for bronze in Tibetan for research on possible etymologies of some Tibeto-Burman phonetic transliterations in to middle Chinese.[0] (As you do) Anyway, I found some low resolution entries in scanned dictionaries online without romanization, but was unable to translate these to codepoints to obtain a phonetic approximation, even after using online keyboards, due to the hassles of combining characters. I have studied a lot of abugidas (Tai/Lao/Khmer/etc.) so am not exactly coming at the problem from scratch, either. Also rather shocked that the Tibetan community hasn't managed to put a decent dictionary online yet.

[0] https://en.wikisource.org/wiki/Translation:Manshu/Chapter_7#...

[+] tantalor|9 years ago|reply
What about 1/4, 3/4, 1/5, etc...?
[+] infogulch|9 years ago|reply
Maybe that would be a good case for combining characters. digits+/+digits = fraction. Where + is combing character and digits is digit(+digit)
[+] koltaggar|9 years ago|reply
Best part is where you swap Andrew West's first name for Adam
[+] kens|9 years ago|reply
Oops, sorry Andrew! Many apologies! (I watched way too much Batman as a child and "Adam West" is wired into my brain.)