Life Before Unicode

[+] sheetjs|4 years ago|reply

Pre-Unicode issues still haunt us today, kept alive by various file formats that rely on system encoding.

Under the Apple "Mac-Roman" encoding [1], the standard MacOS encoding before OSX switched to Unicode, byte 0xBD currently is capital omega (U+03A9 Ω). However, in the original 1994 release of the character set, they erroneously mapped to the ohm sign (U+2126 Ω) Apple eventually fixed this in 1997, as noted in the changelog:

    #       n04  1997-Dec-01    Update to match internal utom<n3>, ufrm<n22>:
    #                           Change standard mapping for 0xBD from U+2126
    #                           to its canonical decomposition, U+03A9.

However, in 1996, Microsoft copied over the mac encoding to CP10000 using the incorrect character [2]. Unfortunately the codepage was not corrected when Apple realized their mistake.

This discrepancy leads to a huge number of strange issues with various versions of Excel for Mac (BOM-less CSV, SYLK and other plaintext formats default to system encoding) and other software that use Microsoft's interpretation of Apple's Mac-Roman encoding rather than Apple's official character set mapping.

[1] http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.T...

[2] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/MAC/RO...

[+] WalterBright|4 years ago|reply

Dealing with multiple code pages was terrible, ugly, impossible, and simply awful.

Unicode was a great salvation. Until the Unicode spec wandered off into nutburgerland:

1. embedding invisible semantic information into the encodings

2. multiple encodings that mean the same thing

3. moved beyond standardizing existing alphabets, and wandered into a never-ending stream of people inventing all sorts of glyphs and campaigning to get them into Unicode

4. set language back 2000 years by deciding that hieroglyphs were better than phonetics

5. conflating 𝖋𝖔𝖓𝖙𝖘 with alphabets

6. the crazy concept of "levels" of Unicode compatibility

all resulting in it simply being impossible to correctly support Unicode in your programs, or design a font to display Unicode.

[+] lifthrasiir|4 years ago|reply

Unicode is just a cumulation of what already existed in one form or another, not an entirely new mistake.

> 1. embedding invisible semantic information into the encodings

> 2. multiple encodings that mean the same thing

Sure, ISO/IEC 8859 back then had no invisible characters nor composed characters, did it? [1]

> 3. moved beyond standardizing existing alphabets, and wandered into a never-ending stream of people inventing all sorts of glyphs and campaigning to get them into Unicode

> 4. set language back 2000 years by deciding that hieroglyphs were better than phonetics

Emoji business is surely ugly (and I have complained about this a lot), but the very reason that Unicode has emoji is that Japanese telcos have exposed their priprietary encodings to the wild email and both Apple and Google had to implement them for the compatibility. Blame them, not Unicode.

> 5. conflating 𝖋𝖔𝖓𝖙𝖘 with alphabets

Yeah, you will be surprised to hear that Unicode encodes both the Latin letter A and Cyrillic letter A separately. (I think you had already said that they should be unified in the past, but I couldn't find that reply.)

> 6. the crazy concept of "levels" of Unicode compatibility

Most Unicode conformance rules are optional (past discussion: [2]).

Also if you should display a Unicode text in your program and you are not using a stock library, it is up to you to decide what to support because they exist for the reason. No BiDi support? You now disregard right-to-left scripts (or left-to-right scripts if your program is for RTL users). No combining character support? Expect to lose a significant portion of potential users around the world. Probably a colored font support is fine to ignore, but that doesn't mean most other things can be ignored without the cost.

[1] https://en.wikipedia.org/wiki/ISO/IEC_8859-8#Code_page_layou...

[2] https://news.ycombinator.com/item?id=26904739

[+] grlass|4 years ago|reply

RE point 4, I believe that this was initially a descriptivist rather than prescriptivist step.

People were using emoticons, showing that emoji might be desirable. Some cell phone manufacturers/carriers added their own emoji [1]. They were very popular. Unicode added emoji to the standard, since lots of people were using it.

Not adding emoji would have been _more_ prescriptivist, and since it was desired anyway would have just caused fragmentation.

[1] https://blog.emojipedia.org/correcting-the-record-on-the-fir...

[+] quickthrower2|4 years ago|reply

Genuine question: can you explain how phonetics can be used in a character encoding for text?

[+] account42|4 years ago|reply

7. Han unification

[+] q-rews|4 years ago|reply

3. Yes, sadly.

4. Can you expand on that?

[+] yoavm|4 years ago|reply

Try this with Hebrew... Because of the RTL nature of the language, we had ISO-8859-8 and ISO-8859-8-I, which as a child I always thought of as "Inverted". The characters would render the same, but going backwards. When entering a website you never knew if the author wrote the text backwards so you can present it as it is, or they wrote it normally so you need to flip it. And I can still recall some websites using CP852, back from the DOS era. Entering a website really did start with about a minute of fiddling with the encoding.

[+] Ndymium|4 years ago|reply

The note about IRC reminds me of troubles that faced Finnish IRC users when UTF-8 got more popular. Since IRC networks generally didn't (and don't) have actual support for different encodings and mostly just deal with byte streams, this brought a lot of issues with Ä, Ö, Å, €, and some other more esoteric characters. Naturally a lot of Ã¤ and � abounded.

An interesting consequence was that channels that had non-ASCII characters in their names were split into two, since to the network they were different characters. I remember taking over a couple of channels by creating the UTF-8 versions of them and waiting for people to slowly migrate over.

Getting all of this to work correctly was quite difficult. With Irssi and similar terminal clients, you'd have to correctly configure not just the client but also screen, $TERM, locale, and your terminal. Even if you had a correctly configured UTF-8 client, you could still have problems. Since you can't tell 8-bit single-byte encodings apart other than by heuristics, typically you would have your client attempt to treat a message as UTF-8 and if it's invalid, use a fallback option like ISO-8859-15 (latin-9). But here's the fun thing about that: since IRC networks only deal with byte streams, they may truncate an UTF-8 message in the middle of a multibyte character. This would fail to be detected as valid UTF-8 and would use the fallback option, leading to mojibake for seemingly no reason.

All of this lead to quite some resistance to UTF-8 on the channels I was on. It was deemed the problems were bigger than the new possibilities. I mean, we could speak Finnish/English just fine and there was usually no need for any other languages. Eventually UTF-8 won, especially when mIRC released a version without an option to change the encoding.

[+] satyanash|4 years ago|reply

> It was deemed the problems were bigger than the new possibilities. I mean, we could speak Finnish/English just fine and there was usually no need for any other languages.

It also excluded a large part of the world from participating on IRC. Maybe if it was more proactive, IRC would have a larger role to play than it does now?

Another example is punycode. unicode URLs are super unreadable, so no wonder the rest of the earth's population doesn't care enough about the open web, the importance of URLs vs Apps.

[+] distances|4 years ago|reply

Our still-alive IRC channel never migrated. You still need to enter \xE4 for the join command to get the correct ISO-8859-1 channel. I think the stream of unintentional visitors stopped around the UTF-8 migration, though that could've also been effect of the decline of IRC use in general public.

[+] anthk|4 years ago|reply

>ISO-8859-15

Ditto in Spanish;, I still have issues at IRC-Hispano with some users as I have UTF-8 with everything.

[+] eloisant|4 years ago|reply

Yes, I moved from France to Japan in 2001 and at the time it was almost impossible to have French accents and Japanese characters on the same OS. The web was where it was working best (because web browsers know to switch encoding for pages) but desktop apps that were still the norm were a shitshow.

Even on Linux, to have support for Japanese language you had to install a Japanese distribution (TurboLinux or RedHat Japanese version for example) and it was a real pain to get accents to work.

Worse, Japanese language had 3 encodings!! One for Windows (SJIS), one for Unix/Linux (EucJP) and ISO-2022-JP. And of course, Japan being Japan, Japanese companies were really lagging to switch to Unicode and stuck with their shitty encodings for a long time.

[+] pezezin|4 years ago|reply

> And of course, Japan being Japan, Japanese companies were really lagging to switch to Unicode and stuck with their shitty encodings for a long time.

Not were, are. I'm currently living in Japan, and most of the emails I receive are encoded in Shift-JIS or ISO-2022-JP. Many websites too. Thankfully modern software is quite robust and displays everything properly, but I still get the random mojibake from time to time.

[+] necovek|4 years ago|reply

Encodings are the easy part. To render stuff, you need fonts, and sometimes they are built into hardware (especially with dot-matrix printers, for instance, but even laser printers embed a couple of fonts, and likely custom Japanese fonts too), and software expects fonts with a particular encoding too. All of these were issues even for "simpler" scripts like Cyrillic (Serbian).

Some of the problems with fonts remain even with Unicode (due to Han unification, different CJK regions will prefer different fonts; there might be OpenType locl-aware fonts these days, but they'd be huge).

[+] xvilka|4 years ago|reply

The only left task is to get rid [1] of UTF-16 usage.

[1] http://utf8everywhere.org/

[+] nanis|4 years ago|reply

One thing they got wrong is the fact that upcase(downcase('İ')) is not reliably 'İ' and downcase(upcase('ı')) is not reliably 'ı' without extra assumptions.

That is, Unicode is missing

(1) Unicode Character 'LATIN CAPITAL LETTER DOTLESS I' (????)

and

(2) Unicode Character 'LATIN SMALL LETTER I WITH DOT ABOVE' (????)

or some such so that upcase('ı') could be mapped unambiguously to (1) and downcase('İ') could be mapped unambiguously to (2). Of course, they would have identical glyphs with 'I' and 'i', respectively.

I haven't investigated if there was a rationale to save two codepoints, but handling documents that contain an unknown mix of Turkish and other languages becomes rather weird. For example, "sık sık" becomes rather inappropriate after going through downcase∘upcase and "kilim" maps to "kılım" after going through the same mapping.

Most software rely on the current locale to make these decisions. That is, of course, not feasible in systems that process documents received from arbitrary sources.

[+] account42|4 years ago|reply

> I haven't investigated if there was a rationale to save two codepoints, but handling documents that contain an unknown mix of Turkish and other languages becomes rather weird.

The main reason is probably that ISO-8859-9 / Windows-1254 already only had a single I and i so it would be impossible to know which Unicode character to convert them to. Still might be worth it to add your suggested codepoints though so that uppercase/lowercase that is not locale-aware can at least work correctly for new text.

[+] bmn__|4 years ago|reply

Other languages have similar problems, although not as striking. It does not make sense to special-case Azeri, Crimean Tatar, Kurdish, Turkish, Tatar as you propose.

The real problem is giving in to the notion that having an unknown mix of languages in a text is an acceptable state of affairs. It's not; we should recognise the problem for what it is and work towards a future where it occurs less. Documents carry metadata about the language content, e.g. HTML and OpenDocument, precisely so that the algorithms which depend on a language like case-folding or word matching work correctly.

The situation is analogous to having a document with an unknown/undeclared text encoding: rely on the metadata where supplied, otherwise make a guess, perhaps informed by statistical properties of the text (e.g. uchardet) rather than system locale. It's not ideal, but works more often than not.

[+] ajuc|4 years ago|reply

Even for Latin-script-based languages pre-unicode times were hard. Polish had about a dozen encodings: https://pl.wikipedia.org/wiki/Kodowanie_polskich_znak%C3%B3w

With the first 5 (except Mac) still in common use in 90s and the first 3 still used when unicode appeared.

Most programmers in Poland in 90s remembered how encoding errors looked like between the most popular encoding pairs because it was a constant struggle.

[+] Grustaf|4 years ago|reply

I remember making vocabulary lists for Russian and Greek in Latex with emacs on Linux, before unicode, it wasn't easy. Definitely spent more time getting it to work than actually using them.

[+] program|4 years ago|reply

Italian alphabet is very simple. Only add some accented letters to the base english alphabet (àèéìòù and uppercase variants). Still today I see a lot of Latin 1 / UTF-8 mixed errors. That's the reason why is very common to see "E'" in official documents instead of "È" and so on.

[+] q-rews|4 years ago|reply

No, the reason for E’ is that it doesn’t appear on the keyword and there’s no obvious way to make it work.

You’ll also occasionally see PERCHè for the same reason.

Also non-digital natives don’t really know the difference between É and E’ on paper, they all look the same.

[+] girishso|4 years ago|reply

This reminded me of CDAC's GIST, for enabling Indian languages on computers, worked on it 2 decades back. https://www.cdac.in/index.aspx?id=mlc_gist_about

[+] coldacid|4 years ago|reply

>Did they add seamless transcoding for old files from DOS? Of course not! A Russian version of Windows effectively used two different encodings: native Windows parts used Windows-1251, while the DOS subsystem still used CP866.

To be fair, for those of us in the west, it was similar, if not quite as bad. CP437 and Windows-1252 were nearly as different. Woe to you if you opened a DOS text file in Notepad, especially if it had ASCII or box art, because it'd look all messed up, and vice versa.

[+] jaza|4 years ago|reply

An informative and entertaining journey through the world of Russian character encodings, of which I previously knew nothing. Thanks for sharing!

[+] Joker_vD|4 years ago|reply

> That encoding is, to put it politely, mildly insane: it was designed so that stripping the 8th bit from it leaves you with a somewhat readable ASCII transliteration of the Russian alphabet, so Russian letters don’t come in their usual order.

No, it is not insane. It simply continues the traditional way of creating the national Morse alphabets — almost all of them are transliterations of the international Morse code. Historical continuity is quite a harsh mistress.

[+] inglor_cz|4 years ago|reply

When I was in the first year of the university (1996), my friend programmed a utility for conversion between six different encodings used in the Czech language space: CP852, KeybCS2, CP1250, ISO8859-2, KOI-8čs a ASCII.

Just look at this page for the impression of how many times had the wheel been reinvented in Czech and Slovak language space...

http://vorisekd.wz.cz/seznam3.htm

[+] amelius|4 years ago|reply

> When the USSR broke down, the market was quickly captured by Western hardware and software because no one was making Soviet hardware anymore anyway.

Is that still the case? Are there any large Soviet hardware manufacturers?

[+] manuelmagic|4 years ago|reply

The second note at the bottom of the page says: "That hardware was too outdated to keep producing in any case, and couldn’t complete in a free market."

So I'd say no, there aren't any large Soviet hardware manufacturers anymore.

[+] throwaddzuzxd|4 years ago|reply

> people who use languages with ASCII alphabets exclusively may think it's unjustified

And what languages would that be? (given that even some borrowed words in English have the letter é for instance)

[+] lonelygirl15a|4 years ago|reply

And that's why the ' character is in ASCII.

Print a lowercase e, backspace, and grave accent ', and you have printed é.

Underscore, ^, and other characters are there specifically for overstriking.

Not many terminals do that any more, though...

[+] ChrisArchitect|4 years ago|reply

this post is based on author's experience -- Unicode issues and encoding problems have continued to be a thing well into the 2010s despite unicode gradually becoming the web defacto (in the West anyways) as web 2.0 ran it's course. I haven't worked on a many international language sites in 4 or 5 years as much but so many of those other regions still have allll of the encoding issues and challenges.

[+] ChrisMarshallNY|4 years ago|reply

When it comes to Unicode, Joel Spolsky's screed is still a classic (note that it was done before UTF-8 became the de facto standard): https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

[+] bellyfullofbac|4 years ago|reply

An old story was someone from France wanting to send the Harry Potter book to someone in Russia. They got their address over email, but in the wrong encoding: https://unicodebook.readthedocs.io/definitions.html#mojibake

[+] vintermann|4 years ago|reply

By comparison, recently I bought some small homemade electronics thing from Germany. The sender had written Ålesund on the package. Deutsche Post, or possibly whatever system Posten bought from the lowest bidder, declared that Å isn't a real character, so they dropped it. Unfortunately Lesund is also a place.

After it had been stuck in the sorting facility for three weeks, I called them up and eventually reached a guy who could explain that oh yes, the system does that and the package gets stuck in a loop, but eventually a human will look at it.

It did in fact arrive eventually.

[+] simonh|4 years ago|reply

I just find it funny that western Europeans can't tell the difference between Cyrillic writing and random gibberish. By which I genuinely don't mean anything pejorative, in either direction.

[+] gpderetta|4 years ago|reply

the most amazing bit is that the postal employees reversed the bad encoding and delivered the package correctly!

[+] chana_masala|4 years ago|reply

> The address was deciphered by the postal employees and delivered successfully.

Wow, I want to know more about how the postal worker who deciphered it! On a computer, sure just convert it. But on paper, how did they work it out?

[+] RicoElectrico|4 years ago|reply

And at no point did the sender stop and think "this doesn't look like Russian". I mean, it's relatively easy for me, who speaks none of the following languages, to tell appart Chinese, Japanese and Korean.

I guess this is consolating for Americans who get blamed for assuming the rest of the world is like their country (postal codes, address formats, date formats, person naming conventions, business customs).

[+] squiggleblaz|4 years ago|reply

The address there is

Russia, Moscow, (postcode), (maybe a street name), (some number), (untranslated) Svetlane

So it seems to be backwards: The personal name is last, and the country/city is first. Does someone here perhaps know if that is the customary order of address in Russia, or if it was reversed too?

[+] q-rews|4 years ago|reply

I hate and love that. Someone must have seen those characters enough times to have become good at deciphering them.

What a waste of time for the postal system though, I’m surprised they didn’t just return to sender.

171 comments