Bear plus snowflake equals polar bear

[+] tialaramex|4 years ago|reply

This is definitely an example where "character" was the wrong word. We start out OK with bytes being distinct, but by the end we're talking about how a character is made out of several characters. If we think in terms of code points (Rust's native char type actually provides here a slightly different thing, a Unicode Scalar Value, which certain types of code point are not, but close enough) clearly a code point isn't made out of several code points, so we needed a different word.

I like squiggle, if you're a text rendering engine you might want to use "glyph" although you might already need that word for something else. But try to avoid character because that word already has far too many meanings, most of which won't be what you wanted.

[+] ademarre|4 years ago|reply

I think the term you are looking for is grapheme cluster.

https://manishearth.github.io/blog/2017/01/14/stop-ascribing...

[+] IncRnd|4 years ago|reply

You can refer to the definitions of these terms. [1] It's best not to reinvent how these terms are used. There has been far too much of that, already.

[1] http://www.unicode.org/glossary/

PS Be prepared for a long read.

[+] xyzzy_plugh|4 years ago|reply

> I like squiggle

Go uses "Runes" which is a pretty unambiguous and memorable term.

Though, in Go's case they don't include use of ZWJ, as they're limited to 32-bits.

[+] hamilyon2|4 years ago|reply

I think, zwjchar has a chance: it is unambiguous, so when in future standard evolves we could distinguish it. Besides, technical people like precision and cryptic names.

[+] asciimike|4 years ago|reply

This reminds me of a blog post that I can no longer remember that discusses how Chinese is a lot of arranging several related kanji into a single character to express a new idea (please link it if you can find it).

I would love to see unicode characters to allow for arbitrary combinations beyond those defined using just ZWJ to allow more flexibility (e.g. blizzard could be created by adding like "slowflake x 5" which creates a single character with five snowflakes, without having to create an entirely new character representing blizzard from snowflake + ZWJ + snowflake).

As an aside, my favorite ZWJ magic is black flag + ZWJ + skull and crossbones = pirate flag.

Also see https://en.wikipedia.org/wiki/Blissymbols for more symbol language fun.

[+] alisonatwork|4 years ago|reply

The notion that Chinese characters describe ideas that are formed from smaller characters that also describe ideas is a nice story, but it's not really the case for the majority of Chinese characters. See the Wikipedia page, Principles of formation section[0].

Many Chinese characters are constructed with two parts of smaller characters, one of which can indicate a general concept while the other provides a vague pronunciation hint. This isn't really good enough to provide you with enough information to guess at either the meaning or the pronunciation if you don't already know the word, but if you do already know the word from spoken language, then the hints might be enough to recognize the character when you read it too.

[0] https://en.wikipedia.org/wiki/Chinese_characters#Principles_...

[+] Kronopath|4 years ago|reply

I think you mean this article on “Yingzi”: http://www.zompist.com/yingzi/yingzi.htm

[+] ampdepolymerase|4 years ago|reply

For the next Unicode emoji extension, we might as well just take a word vector model and PCA project it into the Unicode 1 dimensional space. Somebody who is better at linear algebra should comment and tell me how wrong I am.

[+] noway421|4 years ago|reply

This reminds me of a game I made! Matching emojis to produce the given emoji prompt. I didn't use any complex math tho, and my list of combinations is arbitrarily made by hand.

https://emojibang.com/

[+] raldi|4 years ago|reply

Haha, juice box = apple + hammer.

dang, it would be cool if mods could mark a post as "allow emoji here".

[+] littlestymaar|4 years ago|reply

Only tangentially related, but it's a question that have bothered me for a while: Is there a reason why composite Math symbol aren't part of Unicode? Things like series, limits or integrals look like a really good fit for this kind of composition, and you could get rid of mathjax for 80% of its usage.

[+] 1-more|4 years ago|reply

I gotta think it's because all the math you mention cannot render within a normal line height. That's not the end of the world for rendering, but maybe that counts as a bridge too far in Unicode? Or maybe it just boils down to "because it doesn't yet" and as soon as someone makes a good proposal it'll happen

[+] neolog|4 years ago|reply

There is a proposal by the Julia language team

https://github.com/stevengj/subsuper-proposal

[+] initplus|4 years ago|reply

I guess the question is how much typesetting do you want to stick into unicode? It works OK for emoji because the results are always simple: it's just another individual emoji, with the same size & characteristics as the original.

That said unicode is not free from typesetting weirdness, see the character ﷽.

[+] dhosek|4 years ago|reply

As noted, it's a bit more complicated than just putting things above or below a symbol. Also, if you look closer at the output from MathJax, you'll notice things like the setting of x+y=z has slightly different spacing around the + than the =. For that matter, properly typeset mathematics will also the spaces around the colons differently in, e.g., f: X → { y ∊ *R* : |y| < 1 }. Math typesetting is not a simple matter. I'd also note that the existence of, e.g., ² or ₆ in Unicode is more a result of allowing the mapping of legacy encodings and that there is a preference in Unicode that in general, superscripts and subscripts not be handled through encodings but rather through the layout engine of the application which is why there is no superscripted decimal point or other such characters.

[+] duskwuff|4 years ago|reply

1) Combining emoji compose into a finite and reasonably small set of combined symbols. The set of mathematical formulas isn't finite in the same kind of way.

2) Layout of mathematical formulas is reasonably complicated. It doesn't make sense to force that complexity to be included in every text layout engine.

[+] yccs27|4 years ago|reply

There is UnicodeMath[1], which was developed by Microsoft and is the default representation used by the Word equation editor.

It looks like this:

e=lim┬(n→∞)⁡(1+1/n)^n

f(x)=∑_(k=-∞)^∞〖c_n e^(ⅈkx)〗

∫_(-∞)^(+∞)〖exp⁡(-a/2 x^2) ⅆx〗=√(2π/a)

I find it quite readable, even for quite complicated formulae like the above. You can also replace the unicode symbols with Latex-style escape strings, like \sum or \below.

[1]: https://unicode.org/notes/tn28/UTN28-PlainTextMath-v3.1.pdf

[+] Thev00d00|4 years ago|reply

Thank $diety for emojis making sure most user facing code ends up handling extended code points correctly, and teaching devs about the multibyte encodings.

I imagine in a world without them existing a lot of the non-ascii paths would not be regularly used.

[+] avalys|4 years ago|reply

It is so stupid that Emojis render differently on different platforms. In particular there are subtle differences among the facial expression emojis that make them pretty useless if you’re not sure the recipient is on the same platform that you’re choosing one on.

[+] makeitdouble|4 years ago|reply

It's a reminder that these are glyphs, on the same plane as words, aimed at portability.

I think what you are looking for is stamps and png/gifs, which are also supported in most relevant chat platforms these days.

[+] daveslash|4 years ago|reply

On a serious note: Yes, I agree. On a comical note: Reminds me of the great "Cheeseburger-gate" scandal of 2017, in which Google put the cheese under the burger patty instead of on top of. https://www.theverge.com/2017/10/30/16569346/burgergate-emoj...

[+] SilverRed|4 years ago|reply

Problem is that each tech company uses a proprietary set of emoji so they couldn't standardize even if they wanted. Many instant messaging apps work around this by using their own emoji set for all users but this is also often annoying especially when the emoji are ugly.

[+] amelius|4 years ago|reply

Yes, we could have wide support for inline svg by now.

[+] amarshall|4 years ago|reply

Relatedly, I’ve always rather enjoyed the talk “Love Hotels & Unicode”[1]

[1] https://web.archive.org/web/20120404020043/http://www.reignd...

[+] pgn674|4 years ago|reply

If you want to see how an emoji will render on various platforms, or generally search what's available, Emojipedia is good for that: https://emojipedia.org/kiss-woman-woman-medium-dark-skin-ton...

[+] exporectomy|4 years ago|reply

Is there a good reason these composite emojis aren't all single code points? They have to at least mostly be separate images to be rendered, so it's not like it's an impossible combinatorial explosion. I remember there are only on the order of 1000 valid combinations.

[+] vitaflo|4 years ago|reply

I would assume for backward compatibility. If you don’t have the font with the updated emojis you can still get the reference.

[+] Animats|4 years ago|reply

There was a flap when people realized that you can combine the zero-width joiner with the "Prohibited" symbol to put that on top of anything you don't like.

[+] harimau777|4 years ago|reply

These kind of remind me of ligatures. Not exactly the same since (as far as I'm aware) ligatures are always based on a direct blending/combination of the two character's shapes, but still similar.

[+] simiones|4 years ago|reply

I think programming ligatures kind of break this assumption - particularly != becoming the not equals sign.

[+] mysterydip|4 years ago|reply

I'm working on a game for younger kids using pictures/emojis for "universal" conversation rather than text, so this gave me some good ideas. Thanks!

[+] dllthomas|4 years ago|reply

I've had success playing Concept with my 5yo. https://boardgamegeek.com/boardgame/147151/concept

[+] egypturnash|4 years ago|reply

But as a software developer, it’s always fun to think about edge cases, and squeezing almost 5KB into a 280-“character” tweet is fun

This makes me wonder if anyone has created a version of base64 that uses the vast, sprawling space of unicode to take advantage of these glyph-count-based restrictions.

If they have, I hope they called it uuuniencode.

[+] alisonatwork|4 years ago|reply

There are several of these. base65536 is the one that seems to pop up the most often on HN, although base2048 is more useful for Twitter. On the GitHub page the dev helpfully links to the various implementations: https://github.com/qntm/base2048

[+] ludocode|4 years ago|reply

This is what you're looking for:

https://github.com/qntm/base2048

It can store 385 bytes per tweet. This link includes a bit more technical explanation of how Twitter counts characters towards the limit. Apparently, using the entire range of unicode characters does not improve compression because of the double weighting of emojis and other characters as described in TFA. It links to a base131072 encoding which can only store 297 bytes per tweet.

[+] tptacek|4 years ago|reply

Whoah, it's a crafting game.

[+] dukeofdoom|4 years ago|reply

Emojis need more serious attention. A lot of online conflicts start from miscommunication due to lack of facial expressions. To the point thats its driving a social crisis and division in the real world.

[+] quesera|4 years ago|reply

As a recent observer of a discussion wherein the dramatic differences between single, double, and triple-dot sentence terminators were argued, I think concerns about emoji facial expressions might be valid, but resolving them would still leave us in a world of crisis and division..

[+] bluefirebrand|4 years ago|reply

This seems like much too large a problem to tackle with emoji.

Even as complex as emoji are becoming, they really still don't address the issues behind online miscommunication. And I really doubt they ever could.

[+] tester34|4 years ago|reply

Maybe people need to learn how to express themselves precisely

on the other hand people still for some reason believe that "lol" only means "laughing out loud" which's lol itself.

so maybe it's impossible?

[+] iforgotpassword|4 years ago|reply

The Emoji department of the unicode consortium is really just three letter agencies making sure text parsing and font rendering libs will continue to have exploitable bugs. ;-)

[+] yann-gael|4 years ago|reply

I wonder where is the "substitution" of the codepoint sequence (from a sequence to a single codepoint) done? Very concretely and practically: is it the font doing the substitution? What else if not? How do they decide that a sequence should be substituted?

[+] transfire|4 years ago|reply

Even our character encodings have turned into bloatware.

[+] bmn__|4 years ago|reply

Conspiracy theory:

1. specification authors want to make sure the extended grapheme cluster algorithms are widely adopted so that implementations can correctly deal with devanagari 2. they notice no one gives a shit about brown people and their writing systems 3. combining emojis requiring the use of the same underlying algorithms were popularised in order to push the adoption

[+] pkulak|4 years ago|reply

I'm on Linux right now and, sadly, couldn't see the magic character properly. :(

Anyone know if I'm missing anything, or is there no support for 13.1 yet? My standard routine is to just install every noto font I can find (noto-fonts noto-fonts-cjk noto-fonts-emoji noto-fonts-extra).

[+] xwx|4 years ago|reply

It works for me with Firefox on NixOS using Noto Color Emoji, from, I'm assuming, noto-fonts-emoji.

[+] littlestymaar|4 years ago|reply

Works here on a fresh (yesterday) Linux Mint install.

[+] holler|4 years ago|reply

Really enjoyed, thank you for sharing. It was interesting to learn how emojis are composed from component parts. I'm wondering where that will lead to in the future? Will we have a vast built-in library of emojis of ever increasing complexity?

122 comments