top | item 47011801

(no title)

nofriend | 16 days ago

Your explanation makes it sound like an incredibly stupid decision. I imagine what you're getting at is that 3 bytes were/are sufficient for the basic multilingual plane, which is incidentally also what can be represented in a single utf-16 byte pair. So they imposed the same limitation as utf-16 had on utf-8. This would have seemed logical in a world where utf-16 was the default and utf-8 was some annoying exception they had to get out of the way.

discuss

evanelias|16 days ago

OK, but that makes perfect sense given utf-16 was actually quite widespread in 2003! For example, Windows APIs, MS SQL Server, JavaScript (off the top of my head)... these all still primarily use utf-16 today even. And MySQL also supports utf-16 among many other charsets.

There wasn't a clear winner in utf-8 at the time, especially given its 6-byte-max representation back then. Memory and storage were a lot more limited.

And yes while 6 bytes was the maximum, a bunch of critical paths (e.g. sorting logic) in old MySQL required allocating a worst-case buffer size, so this would have been prohibitively expensive.

booi|3 days ago

This still makes no sense. The UTF-8 standard was adopted really in 1998-ish and the standard was already variable using 1 to 4 bytes. MySQL 4.1, which introduced the utf8 charset, was released in 2004.

Even if there were no codepoints in the 4-byte range yet, they could and should have implemented it anyway. It literally does not take any more storage because it is a variable width encoding.