top | item 19113139

Displaying Japanese and English Text on the Web

85 points| polm23 | 7 years ago |nobadmemories.com | reply

38 comments

order
[+] greglindahl|7 years ago|reply
The advice about utf-8 encoding is a bit incomplete. While adding <meta charset="utf-8" /> is a good idea, it's also important to make sure that the webserver isn't sending an http Content-Type header that's something other than utf-8.

In HTML5, the browser is supposed to not sniff the document for a meta charset if the server headers specify a charset in the Content-type.

[+] kevin_thibedeau|7 years ago|reply
I've been scraping some Japanese sites lately and this has been a minor annoyance. Content-Type rarely has the encoding and Requests doesn't default to UTF-8 so you get mojibake for EUC-JP and UTF-8 unless you intervene.
[+] zepearl|7 years ago|reply
Am I the only one on Linux that has Firefox able to display any characters correctly (e.g. in this case the Japanese chars, arabic/russian chars on some other pages, etc...), but Chrome not being able to at all? (it's since months that I scan websites while writing a prototype sw and it's always been like this when I did some random checks in the browsers)
[+] suspectdoubloon|7 years ago|reply
I had these issues on Fedora, especially around emoji. Firefox was fine but Chrome was not. I ended having to play around with some font settings to force chrome to use noto fonts.
[+] microcolonel|7 years ago|reply
Yeah, you should probably report that somewhere, probably to your distribution maintainer (assuming you're talking about a distro Chromium build, and not Google's Chrome blob).

Google distributes Chrome on Linux on tens of millions of devices every year, it displays fonts just fine for almost everyone.

[+] zerocrates|7 years ago|reply
I haven't had that problem, on the same distro even! I do remember them often differing in what fonts they would select by default, and Chrome often seemed to pick a worse-looking option... but I never really had any problems with the text simply not rendering.
[+] yingw787|7 years ago|reply
I remember in the book "Remote: Office Not Required" by Jason Fried and David Heinmeier Hansson, they mentioned finding internationalization issues early on because globally distributed remote teams naturally dogfood for international audiences. This struck me as one of their key selling points for remote work.
[+] ken|7 years ago|reply
You also, of course, need a company that is receptive to this as a goal. I usually run my computer in whatever language I'm trying to learn at the moment, and I've had more than one company dismiss i18n issues I've discovered as "well, we're not going to sell in that country any time soon!" (i.e., "get back to making the demo look pretty").
[+] Grue3|7 years ago|reply
I had a lot of problems making Japanese text on ichi.moe display correctly. I'm using lang="ja" instead of lang="ja-jp" though, it seems to work and is shorter. The main problem with Japanese characters displayed in Chinese font (which happens when lang property is not set explicitly) is that some characters are barely recognizable as the same character. Compare 誤 in a Chinese font and a Japanese font. [1] Yep, this is the same character according to Unicode. Imagine if Latin letter g, Cyrillic г and Greek 𝛾 had the same Unicode codepoint.

[1] https://en.wiktionary.org/wiki/%E8%AA%A4

[+] bgee|7 years ago|reply
Can you explain more about the 誤 example you provided?

When you say it's barely recognizable, do you mean simplified vs traditional? Because to me (as a native Chinese speaker) the Japanese and traditional look almost identical. I can't comment on traditional vs simplified because I can read both.

If it's simplified vs traditional, I wonder why OS/browser prefers to render the character in simplified form (I assume the Chinese font you are using has both styles).

[+] fireattack|7 years ago|reply
lang="ja" absolutely should work [1], so I have no idea why it doesn't on your website.

Maybe it's related to font you specified which may directly or indirectly (fallback) cause the problem. After all, `font-family` overrides language (which essentially just helps to get the right font(s)).

If you don't mind to provide a test page I can help debugging.

[1]: https://en.wikipedia.org/wiki/User:Fireattack/sandbox

[+] mrob|7 years ago|reply
Unicode's Han unification makes this more difficult than it needs to be. Now that Unicode has more than 16 bits worth of characters anyway, it looks very much like a mistake.

https://en.wikipedia.org/wiki/Han_unification

[+] jessaustin|7 years ago|reply
One can kind of understand why they did it. A lot of glyphs are the same from one language to another. If the codepoint for Chinese "大" were different from that for the Japanese "大" were different from that for the Korean "大", that would generate some complaints. OTOH, a Unicode that includes both "学" and "學" certainly has enough room for several versions of "道". I suspect this will eventually be fixed, but as a series of one-off additional codepoints, not as a giant duplication for each of C, J, K, and V. Some of the problem seems to have been that while the Chinese Unicode experts realized they wanted to write both "学" and "學" in the same document, the Korean experts never considered the possibility of including Chinese or Japanese text in a primarily Korean document. I think by this point they've realized it, though, so it will be fixed eventually.
[+] kijin|7 years ago|reply
Noto fonts are fantastic for multilingual documents, but beware that the CJK versions weigh multiple megabytes each. Loading them as a webfont will eat up a lot of data, not to mention cause a noticeable delay. This is unavoidable when there are 10,000+ characters to encode.

My Korean company website loads a subset of Noto Sans, but uses the system default sans-serif if it is accessed with a mobile device. Fortunately most Koreans don't use Hanja (Kanji/Hanzi) anymore, so visual consistency is not an issue.

[+] priansh|7 years ago|reply
Interestingly, Google very quietly shuttered the Google Translate for web plugin that made it possible to autotranslate sites.

The Web Speech API already exists for Speech Recognition and Speech Synthesis, so hopefully they add a translation API directly to Chrome. Can't call it an _inter_net if it doesn't easily support multiple languages!

[+] dspillett|7 years ago|reply
> Google very quietly shuttered the Google Translate for web plugin that made it possible to autotranslate sites.

The feature still seems to be built into Chrome: https://random.spillett.net/stuff/tmp/translate.png

Or was there a plug-in for other browsers too, that is now AWOL?

[+] fouc|7 years ago|reply
Mixed-language webpages is an interesting problem I hadn't thought about.
[+] reaperducer|7 years ago|reply
It's a fascinating problem, if you're into that kind of problem solving.

I recently had to build a large site that was English, Spanish, and Chinese. Which was fun considering that some of the audience was es-mx, some was es-es, and some were es-419.

[+] sebazzz|7 years ago|reply
Does all this apply for Mandarin / Taiwanese as well?
[+] reaperducer|7 years ago|reply
For PCs, cursive falls back to Comic Sans

Eep!

[+] jandrese|7 years ago|reply
That has to be awesome when viewing historical documents. On the other hand, kids will have a much easier time reading the Constitution in Comic Sans instead of the original cursive.