Ask HN: Unicode text variants?

[+] patio11|12 years ago|reply

Is it recognized as equivalent by all text-search programs?

The magic words you want to look for are [Unicode canonicalization], which aspires to make that (and other string-comparison needs) actually work. Implementation quality across the universe of programs is... mixed.

[+] Someone|12 years ago|reply

Canonicalization is more restricted and different.

More restricted in that it only treats truly identical strings that have multiple representations the same. Normalization won't turn "foo" and "FOO" into the same string, but it will turn "fòó" and "fo<with grave accent>o<with acute accent>" into the same string.

Different in the sense that it creates a new string, rather than comparing two strings. Just like you neither need nor want to do a tolower(s) when comparing case-insensitively, you don't need nor want to normalize unicode to do a normalization invariant comparison.

The unicode standard uses "equivalence" to treat "<fl ligature>" and "fl" as equivalent (see http://en.wikipedia.org/wiki/Unicode_equivalence; Unicode technical report on normalization at http://www.unicode.org/reports/tr15/tr15-18.html)

[+] rchowe|12 years ago|reply

To me, it seems useful in cases where the style is a part of the meaning of the symbol. This mostly comes up in mathematics, where a letter represented in fraktur or blackboard bold has some semantically different meaning, and this meaning can be part of the file instead of part of the foramtting of the file.

The practical part of me agrees with patio11, and that the knowledge gain of having these semantics inherent in the file is offset many times by possibly having to treat different bytes as the same character semantically.

[+] ihuman|12 years ago|reply

To add to that, some languages use some of the same characters from the English language, but formatted differently in order to fit into that language's rules. For example, Japanese characters are the same width, so in Unicode there are the number and letters with ｆｕｌｌ　ｗｉｄｔｈ in order to fit in when English is mixed in with Japanese.

[+] JohnHaugeland|12 years ago|reply

Those are not actually formatting. Those are math symbols. Math relies on formatting to convey meaning, and Unicode is expected to be able to render math correctly without formatting. Therefore, it must be that math symbols' formatting is actually a feature of the character.

My Ubuntu box can't render the first three so I don't know what they are.

"is there" is blackboard lettering, though Unicode insists on calling it double-struck lettering. You may recognize the capital R in double-struck, ℝ, as the symbol for the Reals. http://en.wikipedia.org/wiki/Real_number Similarly, ℤ is the integers, ℂ is the complexes, et cetera. I can say this on HN, without formatting, because unicode.

The word "unicode" is in Mathematical Bold Italic. The words "for different," which you probably interpret as Fraktur, are ... oh wait unicode calls them mathematical fraktur. http://www.w3.org/TR/MathML2/bycodes.html#U1D58C 𝖆 means a ring group. ℵ is used for the cardinality of infinite sets. 𝖌 is a Lie algebra. Etc.

http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbo...

I don't know what the last set are.

[+] xroche|12 years ago|reply

"why" is squared latin capital ("Enclosed Alphanumeric Supplement" Unicode block -- www.unicode.org/charts/PDF/U1F100.pdf‎) "for different" is mathematical bold fraktur ("Mathematical Alphanumeric Symbols" Unicode block -- www.unicode.org/charts/PDF/U1D400.pdf‎) "formatting" is mathematical bold script ("Mathematical Alphanumeric Symbols" Unicode block)

[+] greenyoda|12 years ago|reply

HN Search doesn't find "Unicode text variants", so it's probably not a good idea to use these characters in article titles:

https://hn.algolia.io/#!/story/forever/prefix/0/unicode%20te...

What's weird is that if I copy/paste the variant characters from the article title into HN Search, it matches a whole bunch of articles, none of which are this one.

DuckDuckGo finds this post as its first hit if you copy/paste the title into its search box, but not any other articles about "Unicode" or "variants" (just lots of random junk).

Google, on the other hand, apparently canonicalizes the text, since it returns hits on other articles about Unicode and text variants, as well as this post.

So, here we have three text search programs that behave rather differently.

Also, Firefox doesn't find any of the variant text strings on this page if you search for the normal (ASCII) characters.

[+] GeneralMayhem|12 years ago|reply

Chromium's ctrl-f also canonicalizes, at least well enough to recognize this title.

[+] redox_|12 years ago|reply

You're right; MySQL is currently throwing an error (and therefore we don't even send the item to the Algolia engine). Gonna take a look tomorrow or the day after.

Mysql2::Error: Incorrect string value: '\xF0\x9D\x96\x86 m...' for column 'text' at row 1:

Thank you for the bug report ;)

[+] arikrak|12 years ago|reply

Bing doesn't find it when searched in plain-text, while Google can find it in both their auto-complete suggestions and in Chrome's page search. +1 Google.

[+] EwanG|12 years ago|reply

Anyone else seeing this (latest version of Chrome on Win 8.1) as all boxes? If I highlight and right-click, it will ask me if I want to search for each word, so that I was able to see this is a question about Unicode formatting. I tried setting my fonts to Unicode fonts in the Advanced Settings, but that didn't seem to help.

[+] jonmrodriguez|12 years ago|reply

On latest Chrome in Win 7, I also see all boxes. Interestingly, the page title of the comment thread displays correctly though.

[+] NickNameNick|12 years ago|reply

somewhat interestingly, to me at least, The title is boxes in the page content, but renders correctly in the tab header in chrome on windows for me.

It renders fine in both on my mac.

[+] bskap|12 years ago|reply

IE and Firefox on Windows 8.1 render it properly. Chrome renders it as all boxes.

[+] ibelimb|12 years ago|reply

Chrome on iOS shows it as all boxes.

[+] nikdaheratik|12 years ago|reply

As rchowe said, the different formats are mostly used in higher math and have semantic meanings specific to that context.

These should almost never be used outside that context as they have different byte values from the usual Roman characters (which means the computer doesn't even see them as equivalent without "help" from the programmer), may not be supported by every browser or text search program, and may not even render correctly on many OS systems as the glyph for that character may be swapped in for a glyph from a different font or not at all for older systems.

[+] unknown|12 years ago|reply

[deleted]

[+] Elv13|12 years ago|reply

From what I see, applications developed in latin (charset), but non english speaking regions tend to handle this better. The reason is that people usually want to search without accentuated characters and still find what they are looking for. So issues like this one are handled faster in the dev cycle as users _will_ complain.

[+] arikrak|12 years ago|reply

Cool, this made it to the front page! I just discovered this whole area of unicode and wanted to see if it would show up here and why it exists. I guess the answers about it being for Math make sense.

25 comments