A Localization Horror Story: It Could Happen to You

[+] eloisant|11 years ago|reply

When I was in Japan I did proof reading for a Japanese feature phone. A major Japanese brand, actually. That was really comical.

There was an Australian guy for English, an German guy, an Italian lady, and me for French. What they did prior to the meeting is: * translate from Japanese to English by Japanese people with a poor English level (maybe the software engineers actually) * translate from weird English to other languages by translators who had only the strings, absolutely no context.

In the meeting we had all the strings, and one person from the manufacturers who had access to the "super-confidential" unreleased device.

More than half of the translations were off because of lack of context. The French guy actually translated "Garbage day" to something like "Shitty day", apparently he thought that was a way to mark in your calendar that you had a really bad day.

Pretty often we had sentences like "delete one", and invariably one of us had to ask "One what? I need to know if it's masculine/feminine/neutral". Of course they didn't prepare to that, it was too late to change the code, so they made us do ugly things like "%n item(s)".

Also the Australian guy was loosing faith into humanity: - That sentence, it's completely wrong, it just doesn't mean anything in English. People will just go "WTF?" when they read that - We're not allowed to change the English strings, they're already validated - .....

[+] TeMPOraL|11 years ago|reply

I don't know why nobody seems to put information like "warning: this phone's UI in <your local language> is total and utter crap".

Anyway, what you wrote is exactly why I stick to using all software and webservices - OS, text editors, Facebook, et al. - in en_US instead of my native pl_PL. Because translations are always crappy - even for big players. Lack of context is the key here - translated text often feels out of place, because there usually is some overarching idea behind them that isn't communicated to translators. Then there is lack of consistency. Words in original text often have some site-specific meaning, which tends to also be somehow lost in the translation process. For example, on Facebook the word "like" talks about a well-defined thing, not about the dictionary meaning, so it's totally not ok to randomly replace it with synonyms during translation [0].

I realized at some point that I often look at a crapy translation, guess what was the English original, and then in my mind translate to what it should be in the first place. Because for some strange reason I, the user, have the context, and the paid translation team has not. I guess I'm going to put that into my "Translation issues" file in the "Mysteries of capitalism" drawer, right next to "how on Earth multi-milion media companies can't do a movie translation that isn't a total crap" file. I mean, seriously, you're better off looking for pirated subtitles even if you bought the original because pirates at least seem to have watched the movie they're translating.

</rant>

[0] - I wish more translators would use the approach Jehovah's Witnesses used when doing their own Bible translation. Since it was designed to be studied and analyzed, they preferred accuracy over aesthetics - therefore one of the translation rules was "as much as possible, let's have any given word in original text be always represented by the same word in English". Adhering to that single rule would eliminate like half of the "context missing" problems with software translations.

[+] emcrazyone|11 years ago|reply

@eloisant: I work in automotive where I deal with translations for automotive clusters for one of the largest auto makers. Automotive companies are going to what are called reconfigurables which is basically an instrument cluster with no mechanical gauges; just a screen with gauges rendered by 3D engine. Center stacks too.

I kid you not, the way we translate is to use Google translate as a first pass and then the screens get reviewed by people who know the language. I guess somethings slip through and we evidently pissed off a lot of Chinese folks because of a similar flub. The folks doing the first pass don't know any other languages.

It's quite comical but also a real pain. One of the things I had to work into our code was the left-to-right vs. the right-to-left; you would think it would be just a C-style string but we have to know for text justification semantics.

I don't actually do translations but work on the HMI where the text is displayed. Another pain point is that we have a certain space where text needs to be displayed and everything is fitted using English but after translating to other languages, some strings are much longer than the allotted space.

[+] jdcryans|11 years ago|reply

> and me for French > The French guy actually translated "Garbage day" to something like "Shitty day"

So you translated it to "jour de merde"?

[+] ademarre|11 years ago|reply

Your experience speaks to this observation:

With very few exceptions, consumer hardware companies are bad at software.

[+] dkbrk|11 years ago|reply

As far as I can tell, the best tool for localisation almost nobody is using is http://www.grammaticalframework.org/. Licensing is a mix of GPL, BSD and MIT pieces.

It's a high-level functional programming language with a dependent type system specialised for operating on language ASTs. It's resource library, to quote "covers the morphology and basic syntax of currently 29 languages: Afrikaans, Bulgarian, Catalan, Chinese, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hindi, Japanese, Italian, Latvian, Maltese, Nepali, Norwegian bokmål, Persian, Polish, Punjabi, Romanian, Russian, Sindhi, Spanish, Swedish, Thai, Urdu."

In essence, once it has the language-independent AST, it can produce output in all its supported languages with the correct tenses, genders, inflections, etc.

It also seems to have tools for assisted parsing, so you could have an english document and interactively parse it into the correct AST. In addition, the text can be parameterised semantically, so if you changed the gender of a person, that could propagate to all the correct locations and update the translations as required.

While it seems the upfront cost may be quite high in having to learn such a complex system, I think the benefits of having reproducible, high-quality outputs into n languages for free could make this highly advantageous in many applications.

[+] canjobear|11 years ago|reply

I'm very skeptical that this would work outside of toy examples, though it depends on what is meant by language-independent AST.

For example, the best way to translate Spanish "X dió un golpe a Y" would be "X hit Y". But my naive idea of what the AST for the Spanish sentence would look like would be something like `(GIVE (X HIT Y)`, which when naively transduced to English would be the "X gave a hit to Y", which is either unidiomatic or means the wrong thing altogether. In order to avoid this problem, the AST would have to be a more abstract representation of the semantics. And coming up with a sufficiently expressive, tractable, and neutral representation of natural language semantics is an unsolved problem that people are still devoting their whole careers to.

I was briefly involved in a very early stage startup that was considering using systems like this for better machine translation. We ran into problems like the above, and also: ambiguity, and the fact that the hand-written grammars and semantic representation systems were just very brittle and incomplete.

[+] schoen|11 years ago|reply

That sounds like a great technical approach, although it can't necessarily remove the problem of idiomaticity that the article mentions (with the example of "I didn't search any directories"). Probably better examples are possible, in any case where the most idiomatic way to express something isn't a literal translation of that thing from other languages. Maybe like

English "I don't care"

Portuguese "tanto faz" (literally 'so much does')

German "[das] ist mir egal" (literally '[it] is equal for me')

[+] neilk|11 years ago|reply

Side note: as you might expect, Wikipedia's internationalization is the only system that attempts to do quantities and other formatting correctly for every goddamn language on the planet, but is considerably easier for translators to work with than the OP's examples (sorry, Sean ;)

I did some work on bringing it to JavaScript and making it HTML-aware, and since then Santhosh Thottingal has vastly extended it and it's become pervasive at Wikipedia. More projects should use it, or at least learn from it.

Demo: http://thottingal.in/projects/js/jquery.i18n/demo/

Github: https://github.com/wikimedia/jquery.i18n

[+] bmn_|11 years ago|reply

The Russian demo does not work. Must be 1 котёнок, 2 котёнка.

[+] Concours|11 years ago|reply

I've been looking for something like this for weeks, no luck till now, I guess I should have searched for internationalization instead of localization in this specific case on google and github. This is a great starting point, thanks a bunch for sharing. Perfect for me.

[+] reidrac|11 years ago|reply

Slightly OT, mi favourite localization error was in Ubuntu when they had that nice netbook interface (that later would become Unity). The network icon label was "Rojo" in the Spanish localization, that is the word for "red" color. What?

Well, if you translate "Net" to Spanish you get "Red"; and if you translate that again (by mistake), you get "Rojo". There you are :)

[+] tjradcliffe|11 years ago|reply

I once got a Spanish translation back and was briefly angry that the translator had left a few strings as "TODO"...

"What?!" I asked myself. "To do? Why did they ship me an incomplete file?"

Then I looked to see what the "untranslated" strings were: "ALL" (I don't speak Spanish but have enough of a smattering of European languages that it was immediately obvious what was going on.)

[+] hpshelton|11 years ago|reply

When I worked on Outlook.com, we got feedback from a British user that he couldn't understand what the option to "Connect devices and apps with DAD" meant. Turns out we had forgotten to prevent localization of the "POP" protocol everywhere it was used.

[+] baby|11 years ago|reply

OT means Out of Topic? In french we say HS (for Hors Sujet).

[+] psykovsky|11 years ago|reply

That's an error only an amateur translator would make. Guess who makes free software translations...

A professional translator makes sure to check the context of the translation, it doesn't go blindly translating sentences and words without context.

[+] bmn_|11 years ago|reply

Article is from 1998 and very much out of date. Read: http://blogs.perl.org/users/aristotle/2011/04/stop-using-mak...

[+] EvaK_de|11 years ago|reply

Should be reflected in the title, if possible, neh?

[+] reacweb|11 years ago|reply

It is funny to have such a good article with so many comments at the end that disagree. The issue does not seem to be so clear cut.

[+] whizzkid|11 years ago|reply

It made me remember the Norwegian customer I had.

I needed to write Norwegian localization strings in a YAML file which did not work for some reason.

After 4 hours of debugging, the problem was;

In YAML, the "no:" string (for "norwegian") defined in a YAML file was parsed as a boolean, and this makes the application broke..

[+] mcphage|11 years ago|reply

Yeah, I've run into that, also. Really annoying. I had to quote all of the "no": strings. Looked ugly, but what can you do?

[+] gldalmaso|11 years ago|reply

In order to avoid these pitfalls, usually I get out of "sentence" mode to "label" mode. For instance: "Directories scanned: 12". Probably not well suited for all cases, but usually good enough for mine, though actually I only have to support pt-BR, es-ES and en-US so maybe that's not saying much.

[+] MichaelGG|11 years ago|reply

Exactly. All this work, or just restructure the message. It should be acceptable in most languages, because charts and spreadsheets aren't going to have per cell labels. And it has the benefit of being easier to read and parse.

Also, it's a really terrible style to use first person in an app unless it's actually sentient. Otherwise it's annoyingly like Clippy, or just plain obnoxious and presumptive.

[+] unknown|11 years ago|reply

[deleted]

[+] dmytrish|11 years ago|reply

Kudos to the author of the article for his perseverance in decorating message to fit grammar. I'd go another way, just using more formal and dry format:

    Number of scanned directories: %g
    Number of found files: %g

That solves the problem with Slavic languages at least. Italian aversion to 0 may be mitigated with printing 'none', I guess. Please correct me if this form does not fit other languages.

[+] placebo|11 years ago|reply

Just scanned the comments to see if anyone would suggest that :) Being the lazy type, that's the first thought I had reading the article. Perhaps it's not appropriate for all target audiences and all target languages but many times you can find a much easier solution by going about it in a totally different way.

[+] xorcist|11 years ago|reply

This is my preferred solution as well. As a bonus, it makes parsing the data so much easier, if you ever need it.

[+] mrfoto|11 years ago|reply

Funny how Slovene seems to tick all the complications checkboxes :D We have 4 grammatical numbers (singular, dual, plural for 3 and 4, plural for 5 and above), they repeat at mod 100 (so 101 is singular, 102 dual,…), it's an inflectional language with 3 grammatical genders, sentence should take a different form depending on whether the user is male or female,…

[+] theoh|11 years ago|reply

The two Turkish letters dotted and dotless i are often confused by users of poorly localised software. Wikipedia links to a murder case allegedly caused by this: http://en.wikipedia.org/wiki/Dotted_and_dotless_I

A real horror story.

(Less seriously, Unicode has counterintuitive case-changing behaviours with those letters. If you are working outside the Turkish locale and uppercase a dotless I and then lowercase it, it gains a dot. I am curious about this design decision, since it seems like a basic error in operating a the level of glyphs rather than symbols. Or maybe the opposite.)

[+] lmm|11 years ago|reply

Upper and lower casing can't be assumed to be inverse; there are plenty of other cases where they will change (e.g. precomposed characters that don't have a precomposed upper case). The correct lower-casing of "I" in English is definitely "i"; the correct upper-casing of "ı" in English is maybe a wrong question, because it just isn't an English letter, so I guess you could argue for leaving it unchanged, but converting it to "I" is probably what the person who wrote "ı" would want to happen when it was upper-cased. Maybe?

[+] radnor|11 years ago|reply

The Turkish I problem is seriously annoying. We recently tried adding Turkish translations to our website, only to find that .NET suddenly treats DataRow("ID") differently than DataRow("id") when your locale is set to Turkish.

[+] pavel_lishin|11 years ago|reply

Link to the case in question: http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two...

[+] masklinn|11 years ago|reply

> Less seriously, Unicode has counterintuitive case-changing behaviours with those letters. If you are working outside the Turkish locale and uppercase a dotless I and then lowercase it, it gains a dot.

AFAIK the only solution would be to error out when uppercasing a dotless i in a non-turkish locale. Which I'm not sure sounds better. Or going back in time and creating a separate category of i and I for the turkish script.

[+] olau|11 years ago|reply

Just in case people are wondering about the horror story: use ngettext which is a function in the gettext library.

[+] gulpahum|11 years ago|reply

Indeed, this is a solved problem. Use ngettext, see its page for all different language variations (even the Slovenian four different forms) https://www.gnu.org/software/gettext/manual/html_node/Plural...

[+] DangerousPie|11 years ago|reply

Interesting, but I am wondering if it is really worth going through all this trouble, just to support a few edge cases.

Personally, I usually don't even notice small mistakes like "1 directories" (or similar mistakes in my native language). Sometimes I will see the correct version somewhere and think "Oh, nice that they thought of that" but I definitely don't expect it.

Are the possible returns of having a "perfect" translation really high enough to justify investing in a much more complex system? I am sure translators who can code functions instead of just putting values into an Excel table will come at quite a premium as well...

[+] PinguTS|11 years ago|reply

You may not notice it. But there are other people who do.

That is the difference between a very well designed product or a product, which just does the job.

That is the reason why engineers should not design interfaces. That should be left to UI/UX experts. Just the other day, I as an engineer myself, complained to another that his product does the job nicely and looks OK. But it was missing this little twist of a finished product that I'd really like to use. Because the interface was designed how I would have done it myself, because lack of UI/UX knowledge.

[+] Zarkonnen|11 years ago|reply

Huh. I actually took a stab at solving this problem using language generation with my final year [project](https://github.com/Zarkonnen/A-Natural-Language-Generator-fo...) at university.

[+] ajuc|11 years ago|reply

Another thing:

"Are you sure you want to quit?" in Slavic language will have gender of the user embedded. You need to know it to adress user correctly.

You can do stilted "To the person that uses this program - are you sure you want to quit?", but that's insane. So everybody just use male version.

[+] userulluipeste|11 years ago|reply

I don't know, in Russian either "ты" (singular/informal) or "вы" (plural/formal) works fine for both genders. Now, for your example, Google translated result is "Вы уверены, что хотите выйти?" which seems fine to me!

[+] bobbles|11 years ago|reply

"Are you sure you want to Cancel?"

[OK] or [Cancel]

[+] eCa|11 years ago|reply

Also, if you 'localize' something using Google Translate[1], please let the user choose language somewhere in the app.

For example, the Hostelworld ios app[2] requires the user to change language for the entire device. As something of a language perfectionist it leaves the app virtually useless.

[1] Translating to English from other languages works fine for me.

[2] https://itunes.apple.com/us/app/hostelworld.com-hostels-budg...

[+] cj|11 years ago|reply

I created Localize.js (https://localizejs.com), a localization SaaS.

Pluralization is a challenge, but we're able to solve this with some pretty simple HTML tags.

For example:

Localize.js identifies the <var> tag with the pluralize attribute, and pluralizes the phrase to any language (including languages like Arabic which can have 6 different plural forms).

[+] mfenniak|11 years ago|reply

Huh. Localize.js sounds really cool. Fantastic idea.

Can you explain your example a little more? What would translators see in this case for Arabic; would they need to provide three translations? And if there were two variables, nine translations?

"Localization" also implies a lot more than just translation. Does Localize.js handle work like culture-specific number and date formatting? Different collation of records for different languages? How does it identify application-generated text versus user data (eg. on a blog, does it translate blog comments entered by readers, or just text like "Please enter a comment below")?

[+] julie1|11 years ago|reply

You don't solve the inflexion problem. You would need something where the lemmization tokenization of the sentence is easily accessible to a grammar engine. Something horrible like: <div ><lemm="personnal pronoun">I</lemm> <lemm=verb>have</lemm>< <case=accusative quant=3><lemm=noun quantifier /><lemm=....</div>

[+] unknown|11 years ago|reply

[deleted]

[+] smhg|11 years ago|reply

Now, I don't know the state of the gettext utilities in 1999, but the arguments don't seem to hold up anymore (as others commented).

It just surprises me how many times gettext is discarded as "not solving the problem" while it gets many things right.

It feels like the lack of knowledge about the complexity of i18n/l10n and about gettext are often the real issues.

[+] lmm|11 years ago|reply

I've long found that "externalized" translations in po files (or any equivalent) are more trouble than they're worth, for exactly this reason. Translations need to be functions, so they need to be written in a format that's good for writing functions - i.e. a programming language. What we want is a MessageSource interface, and a bunch of language-specific implementations.

Fortunately I work in Scala, so it's very easy to have an "embedded DSL" that's ordinary, first-class code but not much harder for non-technical translators to read or write than the .po format; we can write helpers for grammatical case or numbers or similar. But having the full power of a programming languages there means that when you hit a case you haven't thought of (and you will), you can fall back to just an if/else.

[+] luminarious|11 years ago|reply

For websites, http://l20n.org seems the most natural version so far. Or is there something better?

[+] barrystaes|11 years ago|reply

I dont agree with the article. The author goes about manually implementing localisations, and eventually throwing out GnuGetText. But it DOES have excellent plural support, and a header in your PO file allows chinese to use "nplurals=1; plural=0;" for example: http://localization-guide.readthedocs.org/en/latest/l10n/plu...

Or use plurals as such: https://www.gnu.org/software/gettext/manual/html_node/Transl...

249 comments