top | item 40195009

You can't just assume UTF-8

195 points| calpaterson | 1 year ago |csvbase.com

471 comments

order
[+] JonChesterfield|1 year ago|reply
How about assume utf-8, and if someone has some binary file they'd rather a program interpret as some other format, they turn it into utf-8 using a standalone program first. Instead of burning this guess-what-bytes-they-might-like nonsense into all the software.

We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.

[+] ezoe|1 year ago|reply
I doubt you can handle UTF-8 properly with that attitude.

The problems is, there is one very popular OS which is very hard to enforce UTF-8 everywhere, Microsoft Windows.

It's very hard to ensure all the software stack you are depending on it use Unicode version of Win32 API. Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back. Even if you don't you have to ensure all the low level code you are depending on it do the same for you.

Oh and don't forget about the Unicode Normalizations. There is no THE UTF-8. There are bunch of UTF-8s with different Unicode normalizations. Apple macOS use NFD while other mostly use NFC.

These are Just some examples. When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.

[+] 998244353|1 year ago|reply
Non-technical users don't want to do that, and won't understand any of that. That's the unfortunate reality of developing software for people.

If Excel generates CSV files with some Windows-1234 encoding, then my "import data from CSV" function needs to handle that, in one way or another. A significant number of people generate CSV files from Excel and if these people upload these files into my application and I assume an incorrect encoding, they won't care that Microsoft is using obsolete or weird defaults. They will see it as a bug in my program and demand that I fix my software. Even if Excel offers them a choice in encoding, they won't understand any of that and more importantly they don't want to deal with that right now, they just want the thing to work.

[+] zarzavat|1 year ago|reply
Agreed. Continuing to support other encodings is like insisting that cars should continue to have cassette tape players.

It’s much easier to tell the people with old cassette tapes to rip them, rather than try to put a tape player in every car.

[+] ryandrake|1 year ago|reply
> We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.

What ever happened to the Robustness Principle[1]? I think the entire comment section of this article has forgotten it. IMO the best software accepts many formats and "deals with it," or at least attempts to, rather than just exiting with "Hahah, Error 19923 Wrong Input Format. Try again, loser."

1: https://en.wikipedia.org/wiki/Robustness_principle

[+] fl7305|1 year ago|reply
> they turn it into utf-8 using a standalone program first

I took the article to be for people who would be writing that "standalone program"?

I have certainly been in a position where I was the person who had to deal with input text files with unknown encodings. There was no-one else to hand off the problem to.

[+] SuperNinKenDo|1 year ago|reply
Not every encoding can make a round trip through Unicode without you writing ad hoc handling code for every single one. There's a number of reasons some of these are still in use and Unicode destroying information is one of them.
[+] thaumasiotes|1 year ago|reply
> We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.

Browsers used to have a menu option to choose the encoding you wanted to use to decode the page.

In Firefox, that's been replaced by the magic option "Repair Text Encoding". There is no justification for this.

They seem to be in the process of disabling that option too:

> Note: On most modern pages, the Repair Text Encoding menu item will be greyed-out because character encoding changes are not supported.

( https://support.mozilla.org/en-US/kb/text-encoding-no-longer... )

This note is logical gibberish; encoding isn't something that has to be supported by the page. Decoding is a choice by the browser!

[+] kelnos|1 year ago|reply
Solutions that require lots of unrelated people to start doing something a different way are not really solutions.
[+] kstrauser|1 year ago|reply
If you give me a computer timestamp without a timezone, I can and will assume it's in UTC. It might not be, but if it's not and I process it as though it is, and the sender doesn't like the results, that's on them. I'm willing to spend approximately zero effort trying to guess what nonstandard thing they're trying to send me unless they're paying me or my company a whole lot of money, in which case I'll convert it to UTC upon import and continue on from there.

Same with UTF-8. Life's too short for bothering with anything else today. I'll deal with some weird janky encoding for the right price, but the first thing I'd do is convert it to UTF-8. Damned if I'm going to complicate the innards of my code with special case code paths for non-UTF-8.

If there were some inherent issue with UTF-8 that made it significantly worse than some other encoding for a given task, I'd be sympathetic to that explanation and wouldn't be such a pain in the neck about this. For instance, if it were the case that it did a bad job of encoding Mandarin or Urdu or Xhosa or Persian, and the people who use those languages strongly preferred to use something else, I'd understand. However, I've never heard a viable explanation for not using UTF-8 other than legacy software support, and if you want to continue to use something ancient and weird, it's on you to adapt it to the rest of the world because they're definitely not going to adapt the world to you.

[+] hnick|1 year ago|reply
It depends on the domain. If you are writing calendar software, it is legitimate to have "floating time" i.e. your medication reminder is at 7pm every day, regardless of time zone, travel, or anything else.

Unfortunately Google and many other companies have decided UTC is the only way, so this causes issues with ICS files that use that format sometimes when they are generating their helpful popups in the GMail inbox.

[+] kccqzy|1 year ago|reply
> For instance, if it were the case that it did a bad job of encoding Mandarin

I don't know if you picked this example on purpose, but using UTF-8 to encode Chinese is 50% larger than the old encoding (GB2312). I remember people cared about this like twenty years ago. I don't know of anyone that still cares about this encoding inefficiency. Any compression algorithm is able to remove such encoding inefficiency while using negligible CPU to decompress.

[+] layer8|1 year ago|reply
> For instance, if it were the case that it did a bad job of encoding Mandarin

Please look up the issues caused by Han unification in Unicode. It’s an important reason why the Chinese and Japanese encodings are still used in their respective territories.

[+] LaffertyDev|1 year ago|reply
I can't help myself. The grandest of nitpicks is coming your way. I'm sorry.

> If you give me a computer timestamp without a timezone, I can and will assume it's in UTC.

Do you mean, give you an _offset_? `2024-04-29T14:03:06.0000-8:00` the `-8:00` is an offset. It only tells you what time this stamp occurred relative to standard time. It does not tell you anything about the region or zone itself. While I have consumed APIs that give me the timezone context as part of the response, none of them are part of the timestamp itself.

The only time you should assume a timestamp is UTC is if it has the `z` at the end (assuming 8601) or is otherwise marked as UTC. Without that, you have absolutely no information about where or when the time has occurred -- it is local time. And if your software assumes a local timestamp is UTC, then I argue it is not the sender of that timestamp's problem that your software is broken.

My desire to meet you at 4pm has no bearing on if the DST switchover has happened, or my government decides to change the timezone rules, or if {any other way the offset for a zone can change for future or past times}. My reminder to take my medicine at 7pm is not centered on UTC or my physical location on the planet. Its just at 7pm. Every day. If I go from New York to Paris then no, I do not want your software to tell me my medicine is actually supposed to be at Midnight. Its 7pm.

But, assuming you aren't doing any future scheduling, calendar appointments, bookings, ticket sales, transportation departure, human-centric logs, or any of the other ways Local Time is incredibly useful -- ignore away.

[+] logrot|1 year ago|reply
Fundamentally I agree, but sadly the world isn't that simple.

You usually end up with having to deal with whatever eccentric sh!t that ultimately comes from the same source as the payment for the job.

[+] mikhailfranco|1 year ago|reply
Developers should assume UTF-8 for text files going forward.

UTF-8 should have no BOM. It is the default. And there are no undefined Byte sequences that need an Order. Requiring a UTF-8 BOM just destroys the happy planned property that ASCII-is-UTF8. Why spoil that good work?

Others variants of Unicode have BOMs, e.g. UTF-16BE. We know CJK languages need UTF-16 for compression. The BOM is only a couple more bytes. No problem, so far so good.

But there are old files, that are in 'platform encoding'. Fine, let there be an OS 'locale', that has a default encoding. That default can be overridden with another OS 'encoding' variable. And that can be overridden by an application arg. And there may be a final default for a specific application, that is only ever used with one language encoding. Then individual files can override all of the above ...

Text file (serialization) formats should define handling of optional BOM followed by an ASCII header that defines the encoding of the body that follows. One can also imagine .3 filetypes that have a unique or default encoding, with values maintained by, say, IANA (like MIME types). XML got this right. JSON and CSV are woefully neglectful, almost to the point of criminal liability.

But in the absence of all of the above, the default-default-default-default-default is UTF-8.

We are talking about the future, not the past. Design for UTF-8 default, and BOMs for other Unicodes. Microsoft should have defined BOMs for Windows-Blah encodings, not for UTF-8!

When the whole world looks to the future, Microsoft will follow. Eventually. Reluctantly.

[+] jrochkind1|1 year ago|reply
The specific use case the OP author was focusing on was CSV. (A format which has no place to signal the encoding inline). They noted that, to this day, Windows Excel will output CSV in Win-1252. (And the user doing the CSV export has probably never even heard of encodings).

If you assume UTF-8, you will have corrupted text.

I agree that I'm mad about Excel outputting Win-1252 CSV by default.

[+] teknopaul|1 year ago|reply
Imho Presume 8 bit, encapsulting 7bit usacsii. That includes utf8 and many many other encodings.

Don't interpret user supplied strings at all. Defines max lengths as byte lengths.

Remain agnostic of encoding. Especially in libraries.

It's easier than people think it is thanks to some very clever people's work a long time ago.

[+] PaulHoule|1 year ago|reply
Programming languages have lumbered slowly towards UTF-8 by default but from time to time you find an environment with a corrupted character encoding.

I worked at an AI company that was ahead of its time (actually I worked at a few of those) where the data scientists had a special talent for finding Docker images with strange configurations so all the time I'd find out one particular container was running a Hungarian or other wrong charset.

(And that's the problem with Docker... People give up on being in control of their environment and just throw in five different kitchen sinks and it works... Sometimes)

[+] calpaterson|1 year ago|reply
If csv files bring criminal liability then I am guilty.

Sidenote: this particular criminal conspiracy is open to potential new members. Please join the Committee To Keep Csv Evil: https://discord.gg/uqu4BkNP5G

Jokes aside, talking about the future is grand but the problem is that data was written in the past and we need to read it in the present. That means that you do have to detect encoding and you can't just decide that the world runs on UTF-8. Even today, mainland China does not use UTF-8 and is not, as far as I know, in the process of switching.

I understand UTF-8 is mostly fine even for east asian languages though - and bytes are cheap

[+] WorldMaker|1 year ago|reply
> We know CJK languages need UTF-16 for compression.

My understanding is that it is for the opposite of compression: it saves memory when uncompressed versus the UTF-8 surrogates needing more bytes. My understanding is that UTF-8 surrogates compress pretty well as they have common patterns that form dictionary "words" just as easily anything else. UTF-8 seems to be winning in the long run for even CJK and astral plane languages on disk and the operating systems and applications that were preferring UTF-16 in memory are mostly only doing so out of backwards compatibility and are themselves often using more UTF-8 buffers internally as those reflect the files at rest.

(.NET has a backwards compatibility based on using UTF-16 codepoint strings by default but has more and more UTF-8 only pathways and has some interesting compile time options now to use UTF-8 only today. Python 3 made the choice that UTF-8 was the only string format to support, even with input from CJK communities. UTF-8 really does seem to be slowly winning everything.)

> JSON and CSV are woefully neglectful,

As the article also points out, JSON probably got it right: UTF-8 only and BOM is an error (because UTF-8) (but parsers are allowed to gently ignore that error if they wish). https://www.rfc-editor.org/rfc/rfc8259#section-8.1

That seems to be the way forward for new text-based formats that only care about backward compatibility with low byte ASCII: UTF-8 only, no BOM. UTF-8 (unlike UTF-16 and missing reservations for some of its needed surrogates) is infinitely expandable if we ever do find a reason to extend past the "astral plane".

(Anyone still working in CSV by choice is maybe guilty of criminal liability though. I still think the best thing Excel could do to help murder CSV is give us a file extension to force Excel to open a JSON file, like .XLJSON. Every time I've resorted to CSV has been because "the user needs to double click the file and open in Excel". Excel has great JSON support, it just won't let you double click a file for it, which is the only problem, because no business executive wants the training on "Data > From JSON" no matter how prominent in the ribbon that tool is.)

> When the whole world looks to the future, Microsoft will follow.

That ship is turning slowly. Windows backward compatibility guarantees likely mean that Windows will always have some UTF-16, but the terminals in Windows now correctly default to UTF-8 (since Windows 10) and even .NET with its compatibility decrees is more "UTF-8 native" than ever (especially when compiling for running on Linux, which is several layers of surprise for anyone that was around in the era where Microsoft picked UCS-2 as its one true format in the first place).

[+] Pet_Ant|1 year ago|reply
> Requiring a UTF-8 BOM just destroys the happy planned property that ASCII-is-UTF8. Why spoil that good work?

So that you know you are dealing with UTF-8. Assuming ASCII only works if you are only dealing with English texts and data.

[+] kazinator|1 year ago|reply
Indeed, you can't assume UTF-8.

What you do, rather, is drop support for non-UTF-8.

Work with tech-stacks whose text handling is based strictly around Unicode and UTF-8, and find enough opportunities that way that you don't have to care about anything else.

Let the customers who cling to data in weird encodings go to someone who makes it a nice to support that.

[+] djha-skin|1 year ago|reply
Joel spolsky spoke against this exact statistics-based approach when he wrote about unicode[1]:

> What do web browsers do if they don’t find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It’s truly weird, but it does seem to work often enough that naïve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it looks ok, until one day, they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept” is quite frankly not a good engineering principle.

1: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

[+] bhaney|1 year ago|reply
I'm just gonna assume UTF-8
[+] duskwuff|1 year ago|reply
I'm disappointed that the article doesn't discuss this in more detail. Most byte sequences are not valid UTF-8. If you can decode a message as UTF-8 with no errors, that is almost certainly the correct encoding to use; it's extremely unlikely that some text in another encoding just happened to be perfectly valid as UTF-8. (The converse is not true; most 8-bit text encodings will happily decode UTF-8 sequences to nonsense strings like 🚩.)

If UTF-8 decoding fails, then it's time to pull out the fancy statistical tools to (unreliably) guess an encoding. But that should be a fallback, not the first thing you try.

[+] calpaterson|1 year ago|reply
榥\ue0af侬펭懃䒥亷
[+] klysm|1 year ago|reply
You can’t just do that! /s
[+] hnick|1 year ago|reply
Based on my past role, you can't even assume UTF-8 when the file says it's UTF-8.

Clients would constantly send CSV or other files with an explicit BOM or other marking indicating UTF-8 but the parser would choke since they just output native Windows-1252 or similar into it. I think some programs just spit it out since it's standard.

[+] groestl|1 year ago|reply
I will assume it, I will enforce it where I can, and I will fight tooth and nail should push come to shove.

I got 99 problems, but charsets aint one of them.

[+] zadokshi|1 year ago|reply
Better to assume UTF8 and fail with a clear message/warning. Sure you can offer to guess to help the end user if it fails, but as other people have pointed out, it’s been standard for a long time now. Even python caved and accepted it as the default: https://peps.python.org/pep-0686/
[+] Veserv|1 year ago|reply
Off-topic, but the bit numbering convention is deliciously confusing.

Little-endian bytes (lowest byte is leftmost) and big-endian bits (bits contributing less numerical value are rightmost) are normal, but the bits are referenced/numbered little-endian (first bit is leftmost even though it contributes the most numerical value). When I first read the numbering convention I thought it was going to be a breath of fresh air of someone using the much more sane, but non-standard, little-endian bits with little-endian bytes, but it was actually another layered twist. Hopefully someday English can write numbers little-endian, which is objectively superior, and do away with this whole mess.

[+] kstrauser|1 year ago|reply
> Hopefully someday English can write numbers little-endian, which is objectively superior

Upon reading this, I threw my laptop out the window.

[+] o11c|1 year ago|reply
Default UTF-8 is better than the linked suggestion of using a heuristic, but failing catastrophic when old data is encountered is unacceptable. There must be a fallback.

(Note that the heuristic for "is this intended to be UTF-8" is pretty reliable, but most other encoding-detection heuristics are very bad quality)

[+] lifthrasiir|1 year ago|reply
You can't just assume UTF-8, but you can verify that it is almost surely encoded in UTF-8 unlike other legacy encodings. Which makes UTF-8 the first and foremost consideration.
[+] norir|1 year ago|reply
If it's turtles all the way down and at every level you use utf-8, it's hard to see how any input with a different encoding (for the same underlying text) will not be detected before any unintended side effects were invoked.

At this point, I don't see any sufficiently good reason to not use utf-8 exclusively in any new system. Conversions to and from other encodings would only be done at well defined boundaries when I'm calling into dependencies that require non utf-8 input for whatever reason.

[+] bandyaboot|1 year ago|reply
> In the most popular character encoding, UTF-8, character number 65 ("A") is written:

> 01000001

> Only the second and final bits are 1, or "on".

Isn’t it more accurate to say that the first and penultimate bits are 1, or “on”?

[+] fl7305|1 year ago|reply
It depends on whether your bit numbering is like x86 (your description), or PowerPC (left most bit is 0).
[+] skerit|1 year ago|reply
This confused me too. Until reading this I didn't even think much about how I read binary numbers right-to-left by default.
[+] vitaut|1 year ago|reply
This is so spectacularly outdated. KOI-8 has been dead for ages.
[+] vkaku|1 year ago|reply
The probability of web content not in UTF-8 is increasingly getting lower and lower.

Last I tracked, as of this month, 0.3% of surveyed web pages used Shift JIS. It has been declining steadily. I really hope people move to UTF-8. While it is important to understand how the code pages and encodings helped, I think it's a good time to actually start moving a lot of applications to use UTF-8. I am perfectly okay if people want to use UTF-16 (the OG Unicode) and it's extensions alternatively, especially for Asian applications.

Yes, historic data preservation requires a different strategy than designing stuff for the future. It is okay to however migrate to these encodings and keep giving old data and software new life.

[+] layer8|1 year ago|reply
Just to be pedantic, the OG Unicode is UCS-2, not UTF-16, the main difference being that surrogate characters didn’t exist originally.
[+] mihaaly|1 year ago|reply
Excellent article, good content, good length, enlightened subtexts and references, joy to read.
[+] lolc|1 year ago|reply
Just the most recent episode: A statistician is using PHP, on Windows, to analyze text for character frequency. He's rather confused by the UTF-16LE encoding and thinks the character "A" is numbered 4100 because that's what is shown in a hex-editor. I tried explaining about the little-endian part, and mb-string functions in PHP. And that PHP is not a good fit for his projects.

Then I realized that this is hilarious and I won't be able to kick him from his local minimum there. Everything he could learn about encodings would first complicate his work.

[+] flohofwoe|1 year ago|reply
The post seems to assume that only UTF-16 has Byte Order Marks, but as pointless as it sounds, UTF-8 has a BOM too (EF BB BF). It seems to be a Windows thing though, haven't seen it in the wild anywhere else (and also rarely on Windows - since text editors typically allow to save UTF-8 files with or without BOM. I guess it depends on the text editor which of those is the default).
[+] LinAGKar|1 year ago|reply
That's not really a byte order mark though, it's just the UTF-8 encoding of U+FEFF, which corresponds to the byte order mark in UTF-16. Honestly, emitting that into UTF-8 was probably the result of a bug originally, caused by Windows Unicode APIs being designed for UTF-16.
[+] calpaterson|1 year ago|reply
Yes you're right, UTF-8 technically does as well. I've never seen them in real life either.

UTF-16 BOMs do have a useful function as I recall: they really help Excel detect your character encoding (Excel is awful at detecting character encoding).

[+] rob74|1 year ago|reply
30 years ago: "you can't just assume ASCII"

Today: "you can't just assume UTF-8"

The more things change, the more they stay the same...