top | item 45227676

(no title)

mikelabatt | 5 months ago

Nice article, thank you. I love UTF-8, but I only advocate it when used with a BOM. Otherwise, an application may have no way of knowing that it is UTF-8, and that it needs to be saved as UTF-8.

Imagine selecting New/Text Document in an environment like File Explorer on Windows: if the initial (empty) file has a BOM, any app will know that it is supposed to be saved again as UTF-8 once you start working on it. But with no BOM, there is no such luck, and corruption may be just around the corner, even when the editor tries to auto-detect the encoding (auto-detection is never easy or 100% reliable, even for basic Latin text with "special" characters)

The same can happen to a plain ASCII file (without a BOM): once you edit it, and you add, say, some accented vowel, the chaos begins. You thought it was Italian, but your favorite text editor might conclude it's Vietnamese! I've even seen Notepad switch to a different default encoding after some Windows updates.

So, UTF-8 yes, but with a BOM. It should be the default in any app and operating system.

discuss

rmunn|5 months ago

The fact that you advocate using a BOM with UTF-8 tells me that you run Windows. Any long-term Unix user has probably seen this error message before (copy and pasted from an issue report I filed just 3 days ago):

    bash: line 1:  #!/bin/bash: No such file or directory

If you've got any experience with Linux, you probably suspect the problem already. If your only experience is with Windows, you might not realize the issue. There's an invisible U+FEFF lurking before the `#!`. So instead of that shell script starting with the `#!` character pair that tells the Linux kernel "The application after the `#!` is the application that should parse and run this file", it actually starts with `<FEFF>#!`, which has no meaning to the kernel. The way this script was invoked meant that Bash did end up running the script, with only one error message (because the line did not start with `#` and therefore it was not interpreted as a Bash comment) that didn't matter to the actual script logic.

This is one of the more common problems caused by putting a BOM in UTF-8 files, but there are others. The issue is that adding a BOM, as can be seen here, *breaks the promise of UTF-8*: that a UTF-8 file that contains only codepoints below U+007F can be processed as-is, and legacy logic that assumes ASCII will parse it correctly. The Linux kernel is perfectly aware of UTF-8, of course, as is Bash. But the kernel logic that looks for `#!`, and the Bash logic that look for a leading `#` as a comment indicator to ignore the line, do *not* assume a leading U+FEFF can be ignored, nor should they (for many reasons).

What should happen is that these days, every application should assume UTF-8 if it isn't informed of the format of the file, unless and until something happens to make it believe it's a different format (such as reading a UTF-16 BOM in the first two bytes of the file). If a file fails to parse as UTF-8 but there are clues that make another encoding sensible, reparsing it as something else (like Windows-1252) might be sensible.

But putting a BOM in UTF-8 causes more problems than it solves, because it *breaks* the fundamental promise of UTF-8: ASCII compatibility with Unicode-unaware logic.

mikelabatt|5 months ago

I like your answer, and the others too, but I suspect I have an even worse problem than running Windows: I am an Amiga user :D

The Amiga always used all 8 bits (ISO-8859-1 by default), so detecting UTF-8 without a BOM is not so easy, especially when you start with an empty file, or in some scenario like the other one I mentioned.

And it's not that Macs and PCs don't have 8-bit legacy or coexistence needs. What you seem to be saying is that compatibility with 7-bit ASCII is sacred, whereas compatibility with 8-bit text encodings is not important.

Since we now have UTF-8 files with BOMs that need to be handled anyway, would it not be better if all the "Unicode-unaware" apps at least supported the BOM (stripping it, in the simplest case)?

3036e4|5 months ago

Also some XML parsers I used choked on UTF-8 BOMs. Not sure if valid XML is allowed to have anything other than clean ASCII in the first few characters before declaring what the encoding is?

taffer|5 months ago

I respectfully disagree. The BOM is a Windows-specific idiosyncrasy resulting from its early adoption of UTF-16. In the Unix world, a BOM is unexpected and causes problems with many programs, such as GCC, PHP and XML parsers. Don't use it!

The correct approach is to use and assume UTF-8 everywhere. 99% of websites use UTF-8. There is no reason to break software by adding a BOM.

Cloudef|5 months ago

BOM is awful as it breaks concatenation. In modern world everything should be just assumed to be UTF8 by default.

cryptonector|5 months ago

You do not need a BOM for UTF-8. Ever. Byte order issues are not a problem for UTF-8 because UTF-8 is manipulated as a string of _bytes_, not as a string of 16-bit or 32-bit code units.

You _do_ need a BOM for UTF-16 and UTF-32.

mikelabatt|5 months ago

In a pure UTF-8 world we would not need it, sure. I get that point. But what do you want to do with 40+ years worth of text files that came after 7-bit ASCII, where they may coexist with UTF-8? If we want to preserve our past, the practical solution is that the OS or app has a default character set for 8-bit text encoding, in addition to supporting (and using as a default) UTF-8.

I also agree that "BOM" is the wrong name for an UTF-8... BOM. Byte order is not the issue. But still, it's a header that says that the file, even if empty, is UTF-8. Detecting an 8-bit legacy character set is much more difficult that recognizing (skipping) a BOM.