In MySQL, don’t use “utf8”, use “utf8mb4” (2016)

[+] chrismorgan|7 years ago|reply

Possibly my favourite thing about the rise of emoji is that they’re not Basic Multilingual Plane, and so stupid hacks like MySQL’s utf8 character set (I seriously don’t know why anyone ever thought it was a good idea in 2002–2004) are now obviously insufficient for full Unicode support, and enough people care about them that adoption of this basic level of Unicode is driven, and so non-English speakers are inadvertently helped.

The next hurdle is to convince people that extended grapheme clusters are the smallest divisible unit, and that just like slicing at the byte level is insufficient if you’re doing UTF-8, slicing at the code point level is insufficient. Basically, strings should just be treated as opaque data in almost all places.

[+] chrismorgan|7 years ago|reply

While I’m contemplating why people might have thought utf8mb3 was wise in 2004, I’ll ask a similar question that I’d love to know the answer to, of events from my early childhood: why did UTF-16 gain popularity? I get UCS-2 existing, but once it became evident it wasn’t enough, why make UTF-16 and ruin Unicode with surrogate pairs (which still cause trouble today, leading to things like WTF-8 to make it possible for web browsers to abandon their UTF-16/UCS-2 mess in favour of almost-UTF-8)? UTF-8 was around by early 1993, while UTF-16 was developed several years later and seems fairly obviously the worst of both worlds—twice as large for normal text, ASCII-incompatible, and not even a fixed-width encoding, the only thing it almost had over UTF-8. The only thing it had going for it was that it was roughly compatible with UCS-2; but was UCS-2 that popular at the time?

(Concerning UCS-2, hindsight says it should have been obvious 65,536 code points was going to be insufficient, but I’ll overlook that as an easy mistake to make; that part is better documented than the rise of UTF-16.)

Possibly the most helpful description I have is from https://en.wikipedia.org/wiki/UTF-16#History, but even that doesn’t explain why it became popular. Why did Java adopt it, Windows, browsers, &c.? Why?

[+] philbarr|7 years ago|reply

> not Basic Multilingual Plane, and so stupid hacks like the utf8 character set (I seriously don’t know why anyone ever thought it was a good idea in 2004) are now obviously insufficient for full Unicode support

From [0]:

UTF-8 was designed with a much larger limit of 231 (2,147,483,648) code points (32,768 planes), and can encode 221 (2,097,152) code points (32 planes) even if limited to 4 bytes

What is wrong with UTF-8?

[0]: https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilin...

[+] paidleaf|7 years ago|reply

> and so stupid hacks like MySQL’s utf8 character set (I seriously don’t know why anyone ever thought it was a good idea in 2002–2004) are now obviously insufficient for full Unicode support

If I had to guess, they did it for space and performance reasons so that their benchmark results are a tiny bit better than other RDBMs. They must've figured that their limited utf8 encoding was enough. And they reasoned the immediate benefits on benchmark tests outweighed future considerations.

I can't imagine any other reason for not implementing the utf-8 encoding completely and covering the unicode character set in its entirety.

[+] udp|7 years ago|reply

There’s nothing wrong with slicing at the byte level if you’re slicing with a delimiter like \n or “. UTF-8 sets high bits so continuation bytes never look like ASCII characters.

[+] unknown|7 years ago|reply

[deleted]

[+] freeone3000|7 years ago|reply

Combined emoji, such as 🇺🇸, are great for this as well! Things like also help - separating them out, even on spaces, breaks the emoji. We are really making progress towards proper processing.

[+] amelius|7 years ago|reply

> Possibly my favourite thing about the rise of emoji is that they’re not Basic Multilingual Plane

When can we have user-defined glyphs/emoji, where SVG is embedded in the unicode string?

[+] unknown|7 years ago|reply

[deleted]

[+] OnlyRepliesToBS|7 years ago|reply

[deleted]

[+] morgo|7 years ago|reply

Hi! Former Product Manager for MySQL Server here. I have a blog post on this here: https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-ov...

The default for MySQL 8.0 (current GA) is utf8mb4. I also recommend reading this introduction post here: http://mysqlserverteam.com/sushi-beer-an-introduction-of-utf...

[+] JdeBP|7 years ago|reply

> An optimization was chosen to limit utf8 to 3 bytes

You chose not to define the UNICODE_32BIT macro that was in the original source from 2002?

* https://github.com/mysql/mysql-server/commit/55e0a9cb01af4b0...

[+] im3w1l|7 years ago|reply

Do you have a collation were no two different strings compare equal? Have you considered imposing an arbitrary order where you don't know which one "should" go first?

[+] Dylan16807|7 years ago|reply

Extremely important: If you use mysqldump, make sure it's also using utf8mb4. There is a very high chance it defaults to utf8 and silently corrupts your data.

[+] crazygringo|7 years ago|reply

Oh my god, this is both terrifying and could one day prove to be the most valuable comment I've seen on HN.

I love MySQL, but the fact that they didn't make utf8mb4 the default for mysqldump (instead of utf8), when they created utf8mb4 for database use... is one of the most irresponsible and negligent pieces of engineering I've seen in a long time, considering the amount of data worldwide backed up with mysqldump.

The fact that a program designed solely to back up databases would default to silent corruption like this is mind-boggling. At least they've fixed this in the future for MySQL 8.0.

[+] tambourine_man|7 years ago|reply

Thanks, that would be:

mysqldump --default-character-set=utf8mb4 -u user database > dump.sql

right?

[+] acabal|7 years ago|reply

This bit me hard many years ago in possibly the most frustrating bug I've encountered--frustrating because of the stupidity of MySQL's implementation. I had a utf8 MySQL database that was created back before utf8mb4 existed--this is, back in the days when emojis were just getting popular, but not quite popular enough to be on software keyboards; and when only a few people had smartphones and iPads, but not everyone.

People would post long-form content to my site that took a long time for them to compose, and maybe once a month, someone would email me to complain that their long post was mysteriously cut short. Inspecting the database showed that the content was being stored there cut off, but with no rhyme or reason. Nothing in the error logs. 95% of other posts went through fine regardless of length or complexity. I put in tons of debugging code at various levels of the app to see where the data was being mis-parsed or truncated. Nothing!

The bug persisted for years without a hint. Finally, emojis became more popular, the bug started occurring more and more often, and MySQL released utf8mb4; and then it dawned on me: Occasionally, people would submit content that contained a Unicode emoji generated by their software keyboard, and MySQL would silently truncate the emoji and all the data after it WITHOUT WARNING when storing it in a utf8 column!! Silently!

I was so mad. A bug that plagued users for years, so much time and effort spent trying to debug it, and it was because of a combination of MySQL's dumb engineering decision to not support full Unicode for something called "utf8" and then silently truncating data without any warning.

I don't know what the lesson is here. But I do know it made me really, really mad at MySQL when I found out what the problem was.

[+] hinkley|7 years ago|reply

I already had substantial beef with MySQL and took a job I wanted despite the fact we were using it. Eight months later we are fixing exactly this bug.

In the world of production code issues most people rank data loss as the most severe or the second most severe category of error.

Designing this kind of crap into a database is just pissing on your users. It was the last straw. I can’t trust you if you work like this. I’m not working with any code written by these clowns again.

[+] mnm1|7 years ago|reply

I think the lesson is "don't silently fail," and in that regard mysql has failed at learning that lesson over and over again to the point where the lesson is now, "don't use mysql" because what else can you do when dealing with such shitty software? This is far, far from the only stupid bug like this. Use mysql only if you don't care about preserving data. Period. There are many valid use cases that don't care about preserving data 100%, apparently most companies, so it's not a huge issue in reality. Hopefully the data that deals with monetary transactions won't get corrupted, but when it does, one cannot complain if they used mysql. It's that simple.

[+] Quarrelsome|7 years ago|reply

Friend of mine just recounted a story of when this bug killed all their MySql instances across an entire enterprise installation because a user sent an email with the hamster face emoji in the title.

[+] bri3d|7 years ago|reply

I worked at large company where an emoji or other 4-byte UTF character in a commit message would take down multiple systems simultaneously, including JIRA and various internal auditing and pipeline monitoring tools. It happened surprisingly rarely considering the lack of safeguard against the issue.

[+] skywhopper|7 years ago|reply

This is a fascinating tale. But the #1 takeaway by the author is dangerous: "Database systems have subtle bugs and oddities, and you can avoid a lot of bugs by avoiding database systems."

This is correct as far as it goes, but it's poor advice. You may as well say "Software has subtle bugs and oddities, and you can avoid a lot of bugs by avoiding software." True! But not helpful! If you don't need a relational database, then yes, don't use one. But there's nothing specific about databases as a category that deserves a special callout. You can also avoid a lot of bugs by avoiding TLS, containers, threads, coroutines, blocking calls, asynchronous calls, networks, filesystems, operating systems, and computers.

[+] kimi|7 years ago|reply

With a slight difference: MySQL or whatnot is battle-tested through millions of systems worldwide, while your in-memory data structure that you save to disk.... :-)

[+] drb91|7 years ago|reply

Dangerous? How do you figure?

Many people assume a DB is the only way to store data. Now that’s dangerous.

[+] ofrzeta|7 years ago|reply

I've "bookmarked" this: https://mathiasbynens.be/notes/mysql-utf8mb4

It digs a little bit deeper on the MySQL side (and it's from 2012 btw)

[+] chx|7 years ago|reply

> If you need a database, don’t use MySQL or MariaDB. Use PostgreSQL.

This is stupid. There are pros and cons to every database. For example, MySQL allows you to set a trigger on an op, PostgreSQL requires you to write a function first.

[+] Strilanc|7 years ago|reply

Wait, isn't 4 bytes per character also not enough to store UTF8? Can't each character be up to 8 bytes long, because the number of leading 1s in the first byte determines the length of the character?

[+] ainar-g|7 years ago|reply

According to the Wiki, the original UTF-8 proposal used up to 6 bytes.[1] But the Unicode standards people decided that 21 bit should be enough for everyone:

>Since the restriction of the Unicode code-space to 21-bit values in 2003, UTF-8 is defined to encode code points in one to four bytes, depending on the number of significant bits in the numerical value of the code point.

So, if they ever decide to use those extra two bytes, it will probably not be called UTF-8. Perhaps something like "UTF-8-MB-6".

[1] https://en.wikipedia.org/wiki/UTF-8#History

[+] philbarr|7 years ago|reply

> Can't each character be up to 8 bytes long

I don't think so. Not to point you to the wiki again but [0]. That's a pretty good description of how it works, so you don't get to use the all of the bits in the first byte to describe how long the character is in bytes.

https://en.wikipedia.org/wiki/UTF-8#Description

[+] unknown|7 years ago|reply

[deleted]

[+] MrBuddyCasino|7 years ago|reply

"An optimization was chosen to limit utf8 to 3 bytes, enough to handle almost all modern languages."

I can't for the life of me imagine how, when discussing this in a room full of competent developers, everyone just went "great idea, lets do this". Just adds to the long list of questionable MySQL engineering decisions.

[+] evanelias|7 years ago|reply

Sure, in hindsight, 3 bytes is an abomination to us today. That was almost certainly not apparent to developers at the time.

MySQL added its three-byte utf8 support in mysql 4.1, first beta release in April 2003 [1]. Meanwhile the utf8 standard originally permitted up to 6 bytes per char, until officially changed in November 2003 [2].

Unicode support in web software -- and even programming languages -- was still rapidly evolving in 2003. The MySQL team probably just made a bad guess along the lines of "the standard allows 6 bytes but that has terrible perf and memory implications for us, and most people only use 3 bytes, so let's limit to that".

While I've lost track of how many hundreds of times I've had to tell coworkers "use utf8mb4 and not utf8 in mysql", I don't begrudge the mysql dev team for not being clairvoyant in 2003!

[1] https://lists.mysql.com/announce/161

[2] https://en.wikipedia.org/wiki/UTF-8#History

[+] jeltz|7 years ago|reply

I am pretty sure it was not because of any optimization but that the real reason was to make sure a VARCHAR(255) would still fit within the maximum field size of InnoDB (768 bytes) at the time.

Since then MySQL has added ROW_FORMAT=DYNAMIC to InnoDB which allows for larger values, but that probably required major changes to InnoDB.

[+] ofrzeta|7 years ago|reply

Side note: It's not enough to change the encoding of the database, you need to make sure that all involved parties use the same encoding (or convert the charset at the interfaces).

For instance in a Laravel app I had to set the charset in the database configuration file "app/config/database.php".

[+] willvarfar|7 years ago|reply

> I’ll make a sweeping statement here: all MySQL and MariaDB users who are currently using “utf8” should actually use “utf8mb4”. Nobody should ever use “utf8”.

I completely agree with this sentiment, although I recall being bitten by it in indexes. The article goes on to talk about how big a CHAR(1) is if its utf8 vs utf8mb4. Well, the same applies to varchars and so a varchar(64) is potentially requiring 64, 192 or 384 bytes, depending on the charset. This can bite you because the MySQL query planner limits the longest prefix used for index scans to iirc 767 or so bytes. So I migrated some columns from utf8 to utf8mb4 and suddenly my compound indices stopped being used because the varchar contents could legally exceed that maximum. Something like that.

[+] chrismorgan|7 years ago|reply

One troublesome thing about utf8mb4: if you have a VARCHAR(n) column, this means 4n bytes are required; this doesn’t mix particularly well with indexes, as you can only index columns of up to 767 bytes; VARCHAR(192) is thus 768 bytes, and too long to be indexed—so you’re limited to VARCHAR(191) or less.

This leads to things like django-celery flat-out not working on utf8mb4 without a (history-rewriting) patch, because it had a VARCHAR(255) primary key: https://github.com/celery/django-celery/issues/259.

[+] Arie|7 years ago|reply

This is supposedly fixed in databases created with MySQL 5.7 and up.

> MySQL 5.7 (2015) added some optimizations such as a variable length sort buffer, and also changed InnoDB’s default row format to DYNAMIC. This allows for indexes on VARCHAR(255) with utf8mb4; something that made migrations more difficult prior.

https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-ov...

[+] pilif|7 years ago|reply

then again, indexes on VARCHAR() can only really be used for equality and start matches.

Now I'm not saying it's impossible, but I have a feeling you're not likely going to do a lot of `where string_column =… ` or `where string_column like '…%'` where … is 192 characters in length.

If you need to do substring search or full text search, a traditional index won't help you anyways.

[+] kreetx|7 years ago|reply

Given that the article says that everyone should use utf8mb4, then why don't/haven't they just fixed it? Should be easy enough, and probably safe, too.

[+] morgo|7 years ago|reply

The desire for backwards compatibility/easier upgrades is why utf8mb4 could not just replace utf8:

The sorting order for the utf8 collations is not identical to utf8mb4 ones, and so an upgrade would require indexed columns to be rebuilt (remember, indexes are sorted).

I think with hindsight it would have been better to rip the bandaid off. 8.0 took a middle of the road approach by switching the default to utf8mb4.

[+] ofrzeta|7 years ago|reply

They have. It's the default in MySQL 8.

[+] johannes1234321|7 years ago|reply

This can't silently be changed as the setting has impact on disk format for data and indexes and so and will be confusing when using different server versions, i.e. in replication.

[+] spectrox|7 years ago|reply

It is worth mentioning that if you're stuck with InnoDB + COMPACT + utf8mb4 then you should be using VARCHAR(191) instead of VARCHAR(255) for unique columns.

https://dev.mysql.com/doc/refman/5.7/en/innodb-row-format-an...

[+] inyourtenement|7 years ago|reply

Currently leading a massive thorny project to convert all of our databases to utf8mb4. Definitely heed this warning.

[+] cdevnull|7 years ago|reply

I use utf8 for "normal" field like name, address, etc.. (don't want smiley as first name). And use utf8mb4 for comment for example.

But I need filter/validate each inputs (and refuse when there is 4 bytes characters).

[+] khaled|7 years ago|reply

There are more characters outside BMP than just emoji.

[+] Ryckes|7 years ago|reply

Sorry for not doing my research but, aren't there 4-byte regular characters from other languages? Is it all emojis and symbols?

[+] codedokode|7 years ago|reply

In case you want to convert your MySQL database tables from utf8 to utf8mb4 please remember that each character will now take 4, not 3 bytes. This means you can easily run out of limit for a row length (65536 bytes) if you have many string columns and your indexes will grow. Also, if a column is a part of a foreign key reference you might have to remove the foreign key first, upgrade columns on both sides and add the foreign key back.

So converting an existing database might be not as easy as it seems before.

[+] danso|7 years ago|reply

The article mentions it, but one of the massive frustrations with this issue is the huge amount of misinformation surrounding it (at least back in 2013), with many well-intended users on StackOverflow and the MySQL forums giving the exactly wrong advice on collation. Naming things, and not getting confused by bad names, is damned hard. This issue led me to quit MySQL for good, I can’t think of a more trivial issue that caused me such confusion and frustration as a software engineer.

[+] cyberferret|7 years ago|reply

Good reminder. I did read a well hidden article many years ago which highly recommended this. I believe I was hitting an obscure, arcane bug like the OP mentioned in the article, and after much digging managed to find this fix.

Since then, I always set every new MySQL database (and even double check the tables) to be 'utf8mb4' even long after the original lessons and pain have been forgotten.

160 comments