top | item 40206363

(no title)

neild | 1 year ago

In addition, Chinese characters encode more information than English letters, so a text written in Chinese will generally consume fewer bytes than the same text in English even when using UTF-8.

(Consider: Horse is five letters, but 馬 is one character. Even at three bytes per character, Chinese wins.)

discuss

order

Panzer04|1 year ago

Presumably that derives from the overhead of encoding an english character as a full byte? Given there's only 26 characters normally, you could fit that into 5 bits instead, which funnily enough does actually line up with the chinese character encoding (5x5 vs 1x24).

kps|1 year ago

Yes. It's the non-Latin alphabets that lose with either UTF-8 or UTF-16, compared with stateful ISO 2022 page switching.

kstrauser|1 year ago

True, but even then you wouldn’t want to store it egregiously badly.