Utf8 is one of the most momentous and under appreciated / relatively unknown achievements in software.
A sketch on a diner placemat has lead to every person in the world being able to communicate written language digitally using a common software stack. Thanks to Ken Thompson and Rob Pike we have avoided the deeply siloed and incompatible world that code pages, wide chars and other insufficient encoding schemes were guiding us towards.
UTF-8 is a great system, but all those dreadful code pages existed because they were under different technical constraints.
Windows machines in 1990s had several megabytes of main memory, and people could barely get it to support one East Asian language at a time, never mind multiple of them. No sane person would propose using three bytes per a Korean character when two would do - that would mean your word processor will die after adding 50 pages of document, while your competitor can do 75.
And even if you did have UTF-8, you wouldn't see those Thai characters anyway, because who would even have these fonts when your OS must fit in a handful of stacked floppies.
It took years before UTF-8 made technical sense for most users.
It's great as a global character set and really enabled the world to move ahead at just the right time when the web started to connect us all together.
But the whole emoji modifier (e.g. guy + heart + lips + girl = one kissing couple character) thing is a disaster. Too many rules made up on the fly that make building an accurate parser a nightmare. It should have either specified this strictly and consistently as part of the standard, or just left it out for a future standard to implement, and just just used separate codepoints for the combinations that were really necessary.
This complexity is also something that has led to multiple vulnerabilities especially on mobiles.
This post is a really good illustration of UTF-8. Very clear! The key brilliance of the design is not only the embedding of ASCII in UTF-8, but the fact that nothing in ASCII can appear anywhere else UTF-8, and more generally that no UTF-8 character can appear as a substring of another character’s encoding. That means that all the byte-oriented libc string functions just work. I wrote this up recently in a StackOverflow answer with some examples: https://stackoverflow.com/a/69756619/659248
It really is wonderful. I was forced to wrap my head around it in the past year while writing a tree-sitter grammar for a language that supports Unicode. Calculating column position gets a whole lot trickier when the preceding codepoints are of variable byte-width!
It's one of those rabbit holes where you can see people whose entire career is wrapped up in incredibly tiny details like what number maps to what symbol - and it can get real political!
UTF-8 is one of the most brilliant things I've ever seen. I only wish it had been invented and caught on before so many influential bodies started using UCS-2 instead.
Like anything new, people had a hard time with it at the beginning.
I remember that I got a home assignment in an interview for a PHP job. The person evaluating my code said I should not have used UTF8, which causes "compatibility problems". At the time, I didn't know better, and I answered that no, it was explicitly created to solve compatibility problems, and that they just didn't understand how to deal with encoding properly.
Needing-less to say, I didn't get the job :)
Same with Python 2 code. So many people, when migrating to Python 3, suddenly though python 3 encoding management was broken, since it was raising so many UnicodeDecodingError.
Only much later people realize the huge number of programs that couldn't deal with non ASCII characters in file paths, html attributes or user names, because they just implicitly assume ASCII. "My code used to work fine", they said. But it worked fine on their machine, set to an english locale, tested only using ascii plain text files on their ascii named directories with their ascii last name.
Absolutely. At least it’s well supported now in very old languages (like C) and very new languages (like Rust). But Java, Javascript, C# and others will probably be stuck using UCS-2 forever.
Fun fact: Ken Thompson and Rob Pike of Unix, Plan 9, Go, and other fame had a heavy influence on the standard while working on Plan 9. To quote Wikipedia:
> Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike.
If that isn't a classic story of an international standard's creation/impactful update, then I don't know what is.
Recently I learned about UTF-16 when doing some stuff with PowerShell on Windows.
Parallel with my annoyance with Microsoft, I realized how long it’s been since I encountered any kind of text encoding drama. As a regular typer of åäö, many hours of my youth was spent on configuring shells, terminal emulators, and IRC clients to use compatible encodings.
The wide adoption of UTF-8 has been truly awesome. Let’s just hope it’s another 15-20 years until I have to deal with UTF-16 again…
There are many reasons why UTF-8 is a better encoding but UTF-16 does at least have the benefit of being simpler. Every scalar value is either encoded as a single unit or a pair of units (leading surrogate + trailing surrogate).
However, Powershell (or more often the host console) has a lot of issues with handling Unicode. This has been improving in recent years but it's still a work in progress.
I never understood why ITF-8 did not use the much simpler encoding of:
- 0xxxxxxx -> 7 bits, ASCII compatible (same as UTF-8)
- 10xxxxxx -> 6 bits, more bits to come
- 11xxxxxx -> final 6 bits.
It has multiple benefits:
- It encodes more bits per octet: 7, 12, 18, 24 vs 7, 11, 16, 21 for UTF-8
- It is easily extensible for more bits.
- Such extra bits extension is backward compatible for reasonable implementations.
The last point is key: UTF-8 would need to invent a new prefix to go beyond 21 bits. Old software would not know the new prefix and what to do with it. With the simpler scheme, they could potentially work out of the box up to at least 30 bits (that's a billion code points, much more than the mere million of 21 bits).
The problem is that UTF-8 has the ability to detect and reject partial characters at the start of the string; this encoding would silently produce an incorrect character. Also, UTF-8 is easily extensible already: the bit patterns 111110xx, 1111110x, and 11111110 are only disallowed for compatibility with UTF-16's limits.
Prefix code: The first byte indicates the number of bytes in the sequence. Reading from a stream can instantaneously decode each individual fully received sequence, without first having to wait for either the first byte of a next sequence or an end-of-stream indication. The length of multi-byte sequences is easily determined by humans as it is simply the number of high-order 1s in the leading byte. An incorrect character will not be decoded if a stream ends mid-sequence.
UTF-8 as defined (or restricted) is a prefix code, it gets all relevant information on the first read, and the rest on the (optional) second. Your scheme requires an unbounded number of reads.
> - It is easily extensible for more bits.
UTF8 already is easily extensible to more bits, either 7 continuation bytes (and 42 bits), or infinite. Neither of which is actually useful to its purposes.
> The last point is key: UTF-8 would need to invent a new prefix to go beyond 21 bits
UTF8 was defined as encoding 31 bits over 6 bytes. It was restricted to 21 bits (over 4 bytes) when unicode itself was restricted to 21 bits.
The current scheme is extensible to 7x6=42 bits (which will probably never be needed). The advantage of the current scheme is that when you read the first byte you know how long the code point is in memory and you have less branching dependencies, i.e. better performance.
EDIT: another huge advantage is that lexicographical comparison/sorting is trivial (usually the ascii version of the code can be reused without modification).
UTF-8 is self-resynchronizing. You can scan forwards and/or backwards and all you have to do is look for bytes that start a UTF-8 codepoint encoding to find the boundaries between codepoints. It's genius.
Excellent presentation! One improvement to consider is that many usages of "code point" should be "Unicode scalar value" instead. Basically, you don't want to use UTF-8 to encode UTF-16 surrogate code points (which are not scalar values).
> Fun fact, UTF-8's prefix scheme can cover up to 31 payload bits.
It’d probably be more correct to say that it was originally defined to cover 31 payload bits: you can easily complete the first byte to get 7 and 8 byte sequences (35 and 41 bits payloads).
Alternatively, you could save the 11111111 leading byte to flag the following bytes as counts (5 bits each since you’d need a flag bit to indicate whether this was the last), then add the actual payload afterwards, this would give you an infinite-size payload, though it would make the payload size dynamic and streamed (where currently you can get the entire USV in two fetches, as the first byte tells you exactly how many continuation bytes you need).
I spent 2 hours last Friday trying to wrap my head around what UTF-8 was (https://www.joelonsoftware.com/2003/10/08/the-absolute-minim is great, but doesn't explain the inner workings like this does) and completely failed, could not understand it. This made it super easy to grok, thank you!
>NOTE: You can always find a character boundary from an arbitrary point in a stream of octets by moving left an octet each time the current octet starts with the bit prefix 10 which indicates a tail octet. At most you'll have to move left 3 octets to find the nearest header octet.
This is incorrect. You can only find boundaries between code points this way.
Until your you learn that not all "user perceived characters" (grapheme clusters) can be expressed as single code point Unicode seems cool. These UTF-8 explanations explain the encoding but leave out this unfortunate detail. Author might not even know this because they deal with subset of Unicode in their life.
If you want to split text between two user perceived characters, not between them, this tutorial does not help.
Unicode encodings are is great if you want to handle subset of languages and characters, if you want to be complete, it's a mess.
You're right, that should read "codepoint boundary" not "character boundary". I can fix that.
I do briefly mention grapheme clusters near the end, didn't want to introduce them as this article was more about the encoding mechanism itself. Maybe a future article after more research :)
Not sure if the issue is with Chrome or my local config generally (bog standard Windows, nothing fancy), but the us-flag example doesn't render as intended. It shows as "US" with the components in the next step being "U" and "S" (not the ASCII characters U & S, the encoding is as intended but those characters are being given in place of the intended).
Displays as I assume intended in Firefox on the same machine: American flag emoji then when broken down in the next step U-in-a-box & S-in-a-box. The other examples seem fine in Chrome.
Take care when using relatively new additions to the Unicode emoji-set, test to make sure your intentions are correctly displayed in all the brower's you might expect your audience to be using.
They aren't new (2010) - this is a Windows thing - speculation is it's a policy decision to avoid awkward conversations with various governments (presumably large customers) about TW , PS and others -- see long discussion here for instance https://answers.microsoft.com/en-us/windows/forum/all/flag-e...
Yeah, there's not much I can do there unfortunately (since I'm using SVG with the actual U and S emojis to show the flag). I can't comment on whether it's your config or not, but I've tested the SVGs on iOS and Firefox/Chrome on desktop to make sure they rendered nicely for most people. Sorry you aren't getting a great experience there.
Great explanation. The only part that tripped me up was in determining the number of octets to represent the codepoint. From the post:
>From the previous diagram the value 0x1F602 falls in the range for a 4 octets header (between 0x10000 and 0x10FFFF)
Using the diagram in the post would be a crutch to rely on. It seems easier to remember the maximum number of "data" bits that each octet layout can support (7, 11, 16, 21). Then by knowing that 0x1F602 maps to 11111011000000010, which is 17 bits, you know it must fit into the 4-octet layout, which can hold 21 bits.
As the continuation bytes always bear the payload in the low 6 bits, Connor Lane Smith suggests writing them out in octal[1]. Though that 3 octets of UTF-8 precisely cover the BMP is also quite convenient and easy to remember (but perhaps don’t use that like MySQL did[2]?..).
If you’re more into watching a presentation, I recorded “A Brief History of Unicode” last year, And there’s a YouTube recording of it as well as the slides:
Great post and intuitive visuals! I recently had to rack my brain around UTF-8 encoding and decoding when building the Unicode ETH Project (https://github.com/devstein/unicode-eth) and this post would have been very useful
[+] [-] bhawks|4 years ago|reply
A sketch on a diner placemat has lead to every person in the world being able to communicate written language digitally using a common software stack. Thanks to Ken Thompson and Rob Pike we have avoided the deeply siloed and incompatible world that code pages, wide chars and other insufficient encoding schemes were guiding us towards.
[+] [-] yongjik|4 years ago|reply
Windows machines in 1990s had several megabytes of main memory, and people could barely get it to support one East Asian language at a time, never mind multiple of them. No sane person would propose using three bytes per a Korean character when two would do - that would mean your word processor will die after adding 50 pages of document, while your competitor can do 75.
And even if you did have UTF-8, you wouldn't see those Thai characters anyway, because who would even have these fonts when your OS must fit in a handful of stacked floppies.
It took years before UTF-8 made technical sense for most users.
[+] [-] GekkePrutser|4 years ago|reply
But the whole emoji modifier (e.g. guy + heart + lips + girl = one kissing couple character) thing is a disaster. Too many rules made up on the fly that make building an accurate parser a nightmare. It should have either specified this strictly and consistently as part of the standard, or just left it out for a future standard to implement, and just just used separate codepoints for the combinations that were really necessary.
This complexity is also something that has led to multiple vulnerabilities especially on mobiles.
See here all the combos: https://unicode.org/emoji/charts/full-emoji-modifiers.html
[+] [-] inglor_cz|4 years ago|reply
Long live UTF-8. Finally I can write any Central European name without mutilating it.
[+] [-] StefanKarpinski|4 years ago|reply
[+] [-] ahelwer|4 years ago|reply
It's one of those rabbit holes where you can see people whose entire career is wrapped up in incredibly tiny details like what number maps to what symbol - and it can get real political!
[+] [-] cryptonector|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] mark-r|4 years ago|reply
[+] [-] BiteCode_dev|4 years ago|reply
I remember that I got a home assignment in an interview for a PHP job. The person evaluating my code said I should not have used UTF8, which causes "compatibility problems". At the time, I didn't know better, and I answered that no, it was explicitly created to solve compatibility problems, and that they just didn't understand how to deal with encoding properly.
Needing-less to say, I didn't get the job :)
Same with Python 2 code. So many people, when migrating to Python 3, suddenly though python 3 encoding management was broken, since it was raising so many UnicodeDecodingError.
Only much later people realize the huge number of programs that couldn't deal with non ASCII characters in file paths, html attributes or user names, because they just implicitly assume ASCII. "My code used to work fine", they said. But it worked fine on their machine, set to an english locale, tested only using ascii plain text files on their ascii named directories with their ascii last name.
[+] [-] josephg|4 years ago|reply
[+] [-] jjice|4 years ago|reply
> Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike.
If that isn't a classic story of an international standard's creation/impactful update, then I don't know what is.
https://en.wikipedia.org/wiki/UTF-8#FSS-UTF
[+] [-] SethMLarson|4 years ago|reply
[+] [-] filleokus|4 years ago|reply
Parallel with my annoyance with Microsoft, I realized how long it’s been since I encountered any kind of text encoding drama. As a regular typer of åäö, many hours of my youth was spent on configuring shells, terminal emulators, and IRC clients to use compatible encodings.
The wide adoption of UTF-8 has been truly awesome. Let’s just hope it’s another 15-20 years until I have to deal with UTF-16 again…
[+] [-] legulere|4 years ago|reply
[+] [-] ChrisSD|4 years ago|reply
However, Powershell (or more often the host console) has a lot of issues with handling Unicode. This has been improving in recent years but it's still a work in progress.
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] pierrebai|4 years ago|reply
The
[+] [-] LegionMammal978|4 years ago|reply
[+] [-] nephrite|4 years ago|reply
Prefix code: The first byte indicates the number of bytes in the sequence. Reading from a stream can instantaneously decode each individual fully received sequence, without first having to wait for either the first byte of a next sequence or an end-of-stream indication. The length of multi-byte sequences is easily determined by humans as it is simply the number of high-order 1s in the leading byte. An incorrect character will not be decoded if a stream ends mid-sequence.
https://en.wikipedia.org/wiki/UTF-8#Comparison_with_other_en...
[+] [-] masklinn|4 years ago|reply
> - It is easily extensible for more bits.
UTF8 already is easily extensible to more bits, either 7 continuation bytes (and 42 bits), or infinite. Neither of which is actually useful to its purposes.
> The last point is key: UTF-8 would need to invent a new prefix to go beyond 21 bits
UTF8 was defined as encoding 31 bits over 6 bytes. It was restricted to 21 bits (over 4 bytes) when unicode itself was restricted to 21 bits.
[+] [-] stkdump|4 years ago|reply
EDIT: another huge advantage is that lexicographical comparison/sorting is trivial (usually the ascii version of the code can be reused without modification).
[+] [-] cryptonector|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] jvolkman|4 years ago|reply
The history of UTF-8 as told by Rob Pike (2003): http://doc.cat-v.org/bell_labs/utf-8_history
Recent HN discussion: https://news.ycombinator.com/item?id=26735958
[+] [-] nayuki|4 years ago|reply
Fun fact, UTF-8's prefix scheme can cover up to 31 payload bits. See https://en.wikipedia.org/wiki/UTF-8#FSS-UTF , section "FSS-UTF (1992) / UTF-8 (1993)".
A manifesto that was much more important ~15 years ago when UTF-8 hadn't completely won yet: https://utf8everywhere.org/
[+] [-] masklinn|4 years ago|reply
It’d probably be more correct to say that it was originally defined to cover 31 payload bits: you can easily complete the first byte to get 7 and 8 byte sequences (35 and 41 bits payloads).
Alternatively, you could save the 11111111 leading byte to flag the following bytes as counts (5 bits each since you’d need a flag bit to indicate whether this was the last), then add the actual payload afterwards, this would give you an infinite-size payload, though it would make the payload size dynamic and streamed (where currently you can get the entire USV in two fetches, as the first byte tells you exactly how many continuation bytes you need).
[+] [-] CountSessine|4 years ago|reply
The awful truth is that there is such a beast. UTF-8 wrapper with UTF-16 surrogate pairs.
https://en.wikipedia.org/wiki/CESU-8
[+] [-] karsinkk|4 years ago|reply
[+] [-] RoddaWallPro|4 years ago|reply
[+] [-] nabla9|4 years ago|reply
This is incorrect. You can only find boundaries between code points this way.
Until your you learn that not all "user perceived characters" (grapheme clusters) can be expressed as single code point Unicode seems cool. These UTF-8 explanations explain the encoding but leave out this unfortunate detail. Author might not even know this because they deal with subset of Unicode in their life.
If you want to split text between two user perceived characters, not between them, this tutorial does not help.
Unicode encodings are is great if you want to handle subset of languages and characters, if you want to be complete, it's a mess.
[+] [-] SethMLarson|4 years ago|reply
I do briefly mention grapheme clusters near the end, didn't want to introduce them as this article was more about the encoding mechanism itself. Maybe a future article after more research :)
[+] [-] dspillett|4 years ago|reply
Displays as I assume intended in Firefox on the same machine: American flag emoji then when broken down in the next step U-in-a-box & S-in-a-box. The other examples seem fine in Chrome.
Take care when using relatively new additions to the Unicode emoji-set, test to make sure your intentions are correctly displayed in all the brower's you might expect your audience to be using.
[+] [-] andylynch|4 years ago|reply
[+] [-] SethMLarson|4 years ago|reply
Here's how it's rendering for me on Firefox: https://pasteboard.co/rjLtqANVQUIJ.png
[+] [-] daenz|4 years ago|reply
>From the previous diagram the value 0x1F602 falls in the range for a 4 octets header (between 0x10000 and 0x10FFFF)
Using the diagram in the post would be a crutch to rely on. It seems easier to remember the maximum number of "data" bits that each octet layout can support (7, 11, 16, 21). Then by knowing that 0x1F602 maps to 11111011000000010, which is 17 bits, you know it must fit into the 4-octet layout, which can hold 21 bits.
[+] [-] mananaysiempre|4 years ago|reply
[1] http://www.lubutu.com/soso/write-out-unicode-in-octal
[2] https://mathiasbynens.be/notes/mysql-utf8mb4
[+] [-] bumblebritches5|4 years ago|reply
[deleted]
[+] [-] loco5niner|4 years ago|reply
I'd like to add a correction. The binary/ascii/utf-8 value of 'a' (hex 0x61) is not 01010111, but instead 01100001.
This is used incorrectly in both the Giant reference card, and in the "ascii encoding" diagram above it.
[+] [-] DannyB2|4 years ago|reply
The bytes come out as:
0xF0 0x9F 0x87 0xBA 0xF0 0x9F 0x87 0xBA
but the bits directly above them all of the bit pattern: 010 10111
[+] [-] SethMLarson|4 years ago|reply
[+] [-] burtekd|4 years ago|reply
[+] [-] alblue|4 years ago|reply
https://speakerdeck.com/alblue/a-brief-history-of-unicode-45...
https://youtu.be/NN3g4JbbjTE
[+] [-] devstein|4 years ago|reply
[+] [-] zaik|4 years ago|reply
[+] [-] jeremieb|4 years ago|reply
[+] [-] satysin|4 years ago|reply
[+] [-] jsrcout|4 years ago|reply
[+] [-] SethMLarson|4 years ago|reply