top | item 30259097

How UTF-8 Works

359 points| SethMLarson | 4 years ago |sethmlarson.dev | reply

190 comments

order
[+] bhawks|4 years ago|reply
Utf8 is one of the most momentous and under appreciated / relatively unknown achievements in software.

A sketch on a diner placemat has lead to every person in the world being able to communicate written language digitally using a common software stack. Thanks to Ken Thompson and Rob Pike we have avoided the deeply siloed and incompatible world that code pages, wide chars and other insufficient encoding schemes were guiding us towards.

[+] yongjik|4 years ago|reply
UTF-8 is a great system, but all those dreadful code pages existed because they were under different technical constraints.

Windows machines in 1990s had several megabytes of main memory, and people could barely get it to support one East Asian language at a time, never mind multiple of them. No sane person would propose using three bytes per a Korean character when two would do - that would mean your word processor will die after adding 50 pages of document, while your competitor can do 75.

And even if you did have UTF-8, you wouldn't see those Thai characters anyway, because who would even have these fonts when your OS must fit in a handful of stacked floppies.

It took years before UTF-8 made technical sense for most users.

[+] GekkePrutser|4 years ago|reply
It's great as a global character set and really enabled the world to move ahead at just the right time when the web started to connect us all together.

But the whole emoji modifier (e.g. guy + heart + lips + girl = one kissing couple character) thing is a disaster. Too many rules made up on the fly that make building an accurate parser a nightmare. It should have either specified this strictly and consistently as part of the standard, or just left it out for a future standard to implement, and just just used separate codepoints for the combinations that were really necessary.

This complexity is also something that has led to multiple vulnerabilities especially on mobiles.

See here all the combos: https://unicode.org/emoji/charts/full-emoji-modifiers.html

[+] inglor_cz|4 years ago|reply
As a young Czech programming acolyte in the late 1990s, I had to cope with several competing 8-bit encodings. It was a pure nightmare.

Long live UTF-8. Finally I can write any Central European name without mutilating it.

[+] StefanKarpinski|4 years ago|reply
This post is a really good illustration of UTF-8. Very clear! The key brilliance of the design is not only the embedding of ASCII in UTF-8, but the fact that nothing in ASCII can appear anywhere else UTF-8, and more generally that no UTF-8 character can appear as a substring of another character’s encoding. That means that all the byte-oriented libc string functions just work. I wrote this up recently in a StackOverflow answer with some examples: https://stackoverflow.com/a/69756619/659248
[+] ahelwer|4 years ago|reply
It really is wonderful. I was forced to wrap my head around it in the past year while writing a tree-sitter grammar for a language that supports Unicode. Calculating column position gets a whole lot trickier when the preceding codepoints are of variable byte-width!

It's one of those rabbit holes where you can see people whose entire career is wrapped up in incredibly tiny details like what number maps to what symbol - and it can get real political!

[+] cryptonector|4 years ago|reply
And stayed ASCII-compatible. And did not have to go to wide chars. And it does not suck. And it resynchronizes. And...
[+] mark-r|4 years ago|reply
UTF-8 is one of the most brilliant things I've ever seen. I only wish it had been invented and caught on before so many influential bodies started using UCS-2 instead.
[+] BiteCode_dev|4 years ago|reply
Like anything new, people had a hard time with it at the beginning.

I remember that I got a home assignment in an interview for a PHP job. The person evaluating my code said I should not have used UTF8, which causes "compatibility problems". At the time, I didn't know better, and I answered that no, it was explicitly created to solve compatibility problems, and that they just didn't understand how to deal with encoding properly.

Needing-less to say, I didn't get the job :)

Same with Python 2 code. So many people, when migrating to Python 3, suddenly though python 3 encoding management was broken, since it was raising so many UnicodeDecodingError.

Only much later people realize the huge number of programs that couldn't deal with non ASCII characters in file paths, html attributes or user names, because they just implicitly assume ASCII. "My code used to work fine", they said. But it worked fine on their machine, set to an english locale, tested only using ascii plain text files on their ascii named directories with their ascii last name.

[+] josephg|4 years ago|reply
Absolutely. At least it’s well supported now in very old languages (like C) and very new languages (like Rust). But Java, Javascript, C# and others will probably be stuck using UCS-2 forever.
[+] jjice|4 years ago|reply
Fun fact: Ken Thompson and Rob Pike of Unix, Plan 9, Go, and other fame had a heavy influence on the standard while working on Plan 9. To quote Wikipedia:

> Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike.

If that isn't a classic story of an international standard's creation/impactful update, then I don't know what is.

https://en.wikipedia.org/wiki/UTF-8#FSS-UTF

[+] SethMLarson|4 years ago|reply
I knew that Ken Thompson had an influence but wasn't aware of Rob Pike, what a great fact! Thanks for sharing this :)
[+] filleokus|4 years ago|reply
Recently I learned about UTF-16 when doing some stuff with PowerShell on Windows.

Parallel with my annoyance with Microsoft, I realized how long it’s been since I encountered any kind of text encoding drama. As a regular typer of åäö, many hours of my youth was spent on configuring shells, terminal emulators, and IRC clients to use compatible encodings.

The wide adoption of UTF-8 has been truly awesome. Let’s just hope it’s another 15-20 years until I have to deal with UTF-16 again…

[+] ChrisSD|4 years ago|reply
There are many reasons why UTF-8 is a better encoding but UTF-16 does at least have the benefit of being simpler. Every scalar value is either encoded as a single unit or a pair of units (leading surrogate + trailing surrogate).

However, Powershell (or more often the host console) has a lot of issues with handling Unicode. This has been improving in recent years but it's still a work in progress.

[+] pierrebai|4 years ago|reply
I never understood why ITF-8 did not use the much simpler encoding of:

    - 0xxxxxxx -> 7 bits, ASCII compatible (same as UTF-8)
    - 10xxxxxx -> 6 bits, more bits to come
    - 11xxxxxx -> final 6 bits.
It has multiple benefits:

    - It encodes more bits per octet: 7, 12, 18, 24 vs 7, 11, 16, 21 for UTF-8
    - It is easily extensible for more bits.
    - Such extra bits extension is backward compatible for reasonable implementations.
The last point is key: UTF-8 would need to invent a new prefix to go beyond 21 bits. Old software would not know the new prefix and what to do with it. With the simpler scheme, they could potentially work out of the box up to at least 30 bits (that's a billion code points, much more than the mere million of 21 bits).

The

[+] LegionMammal978|4 years ago|reply
The problem is that UTF-8 has the ability to detect and reject partial characters at the start of the string; this encoding would silently produce an incorrect character. Also, UTF-8 is easily extensible already: the bit patterns 111110xx, 1111110x, and 11111110 are only disallowed for compatibility with UTF-16's limits.
[+] nephrite|4 years ago|reply
From the Wikipedia article:

Prefix code: The first byte indicates the number of bytes in the sequence. Reading from a stream can instantaneously decode each individual fully received sequence, without first having to wait for either the first byte of a next sequence or an end-of-stream indication. The length of multi-byte sequences is easily determined by humans as it is simply the number of high-order 1s in the leading byte. An incorrect character will not be decoded if a stream ends mid-sequence.

https://en.wikipedia.org/wiki/UTF-8#Comparison_with_other_en...

[+] masklinn|4 years ago|reply
UTF-8 as defined (or restricted) is a prefix code, it gets all relevant information on the first read, and the rest on the (optional) second. Your scheme requires an unbounded number of reads.

> - It is easily extensible for more bits.

UTF8 already is easily extensible to more bits, either 7 continuation bytes (and 42 bits), or infinite. Neither of which is actually useful to its purposes.

> The last point is key: UTF-8 would need to invent a new prefix to go beyond 21 bits

UTF8 was defined as encoding 31 bits over 6 bytes. It was restricted to 21 bits (over 4 bytes) when unicode itself was restricted to 21 bits.

[+] stkdump|4 years ago|reply
The current scheme is extensible to 7x6=42 bits (which will probably never be needed). The advantage of the current scheme is that when you read the first byte you know how long the code point is in memory and you have less branching dependencies, i.e. better performance.

EDIT: another huge advantage is that lexicographical comparison/sorting is trivial (usually the ascii version of the code can be reused without modification).

[+] cryptonector|4 years ago|reply
UTF-8 is self-resynchronizing. You can scan forwards and/or backwards and all you have to do is look for bytes that start a UTF-8 codepoint encoding to find the boundaries between codepoints. It's genius.
[+] nayuki|4 years ago|reply
Excellent presentation! One improvement to consider is that many usages of "code point" should be "Unicode scalar value" instead. Basically, you don't want to use UTF-8 to encode UTF-16 surrogate code points (which are not scalar values).

Fun fact, UTF-8's prefix scheme can cover up to 31 payload bits. See https://en.wikipedia.org/wiki/UTF-8#FSS-UTF , section "FSS-UTF (1992) / UTF-8 (1993)".

A manifesto that was much more important ~15 years ago when UTF-8 hadn't completely won yet: https://utf8everywhere.org/

[+] masklinn|4 years ago|reply
> Fun fact, UTF-8's prefix scheme can cover up to 31 payload bits.

It’d probably be more correct to say that it was originally defined to cover 31 payload bits: you can easily complete the first byte to get 7 and 8 byte sequences (35 and 41 bits payloads).

Alternatively, you could save the 11111111 leading byte to flag the following bytes as counts (5 bits each since you’d need a flag bit to indicate whether this was the last), then add the actual payload afterwards, this would give you an infinite-size payload, though it would make the payload size dynamic and streamed (where currently you can get the entire USV in two fetches, as the first byte tells you exactly how many continuation bytes you need).

[+] nabla9|4 years ago|reply
>NOTE: You can always find a character boundary from an arbitrary point in a stream of octets by moving left an octet each time the current octet starts with the bit prefix 10 which indicates a tail octet. At most you'll have to move left 3 octets to find the nearest header octet.

This is incorrect. You can only find boundaries between code points this way.

Until your you learn that not all "user perceived characters" (grapheme clusters) can be expressed as single code point Unicode seems cool. These UTF-8 explanations explain the encoding but leave out this unfortunate detail. Author might not even know this because they deal with subset of Unicode in their life.

If you want to split text between two user perceived characters, not between them, this tutorial does not help.

Unicode encodings are is great if you want to handle subset of languages and characters, if you want to be complete, it's a mess.

[+] SethMLarson|4 years ago|reply
You're right, that should read "codepoint boundary" not "character boundary". I can fix that.

I do briefly mention grapheme clusters near the end, didn't want to introduce them as this article was more about the encoding mechanism itself. Maybe a future article after more research :)

[+] dspillett|4 years ago|reply
Not sure if the issue is with Chrome or my local config generally (bog standard Windows, nothing fancy), but the us-flag example doesn't render as intended. It shows as "US" with the components in the next step being "U" and "S" (not the ASCII characters U & S, the encoding is as intended but those characters are being given in place of the intended).

Displays as I assume intended in Firefox on the same machine: American flag emoji then when broken down in the next step U-in-a-box & S-in-a-box. The other examples seem fine in Chrome.

Take care when using relatively new additions to the Unicode emoji-set, test to make sure your intentions are correctly displayed in all the brower's you might expect your audience to be using.

[+] SethMLarson|4 years ago|reply
Yeah, there's not much I can do there unfortunately (since I'm using SVG with the actual U and S emojis to show the flag). I can't comment on whether it's your config or not, but I've tested the SVGs on iOS and Firefox/Chrome on desktop to make sure they rendered nicely for most people. Sorry you aren't getting a great experience there.

Here's how it's rendering for me on Firefox: https://pasteboard.co/rjLtqANVQUIJ.png

[+] daenz|4 years ago|reply
Great explanation. The only part that tripped me up was in determining the number of octets to represent the codepoint. From the post:

>From the previous diagram the value 0x1F602 falls in the range for a 4 octets header (between 0x10000 and 0x10FFFF)

Using the diagram in the post would be a crutch to rely on. It seems easier to remember the maximum number of "data" bits that each octet layout can support (7, 11, 16, 21). Then by knowing that 0x1F602 maps to 11111011000000010, which is 17 bits, you know it must fit into the 4-octet layout, which can hold 21 bits.

[+] loco5niner|4 years ago|reply
Excellent article, really helped me learn.

I'd like to add a correction. The binary/ascii/utf-8 value of 'a' (hex 0x61) is not 01010111, but instead 01100001.

This is used incorrectly in both the Giant reference card, and in the "ascii encoding" diagram above it.

[+] DannyB2|4 years ago|reply
There is an error in the first example under Giant Reference Card.

The bytes come out as:

0xF0 0x9F 0x87 0xBA 0xF0 0x9F 0x87 0xBA

but the bits directly above them all of the bit pattern: 010 10111

[+] SethMLarson|4 years ago|reply
Great eye! I'll fix this and push it out.
[+] satysin|4 years ago|reply
This is without question one of the best short technical presentations I've seen. To the author hats off to a masterful job.
[+] jsrcout|4 years ago|reply
This may be the first explanation of Unicode representation that I can actually follow. Great work.
[+] SethMLarson|4 years ago|reply
Wow, thank you for the kind words. You've made my morning!!