(no title)
mikelabatt | 5 months ago
Imagine selecting New/Text Document in an environment like File Explorer on Windows: if the initial (empty) file has a BOM, any app will know that it is supposed to be saved again as UTF-8 once you start working on it. But with no BOM, there is no such luck, and corruption may be just around the corner, even when the editor tries to auto-detect the encoding (auto-detection is never easy or 100% reliable, even for basic Latin text with "special" characters)
The same can happen to a plain ASCII file (without a BOM): once you edit it, and you add, say, some accented vowel, the chaos begins. You thought it was Italian, but your favorite text editor might conclude it's Vietnamese! I've even seen Notepad switch to a different default encoding after some Windows updates.
So, UTF-8 yes, but with a BOM. It should be the default in any app and operating system.
rmunn|5 months ago
This is one of the more common problems caused by putting a BOM in UTF-8 files, but there are others. The issue is that adding a BOM, as can be seen here, *breaks the promise of UTF-8*: that a UTF-8 file that contains only codepoints below U+007F can be processed as-is, and legacy logic that assumes ASCII will parse it correctly. The Linux kernel is perfectly aware of UTF-8, of course, as is Bash. But the kernel logic that looks for `#!`, and the Bash logic that look for a leading `#` as a comment indicator to ignore the line, do *not* assume a leading U+FEFF can be ignored, nor should they (for many reasons).
What should happen is that these days, every application should assume UTF-8 if it isn't informed of the format of the file, unless and until something happens to make it believe it's a different format (such as reading a UTF-16 BOM in the first two bytes of the file). If a file fails to parse as UTF-8 but there are clues that make another encoding sensible, reparsing it as something else (like Windows-1252) might be sensible.
But putting a BOM in UTF-8 causes more problems than it solves, because it *breaks* the fundamental promise of UTF-8: ASCII compatibility with Unicode-unaware logic.
mikelabatt|5 months ago
The Amiga always used all 8 bits (ISO-8859-1 by default), so detecting UTF-8 without a BOM is not so easy, especially when you start with an empty file, or in some scenario like the other one I mentioned.
And it's not that Macs and PCs don't have 8-bit legacy or coexistence needs. What you seem to be saying is that compatibility with 7-bit ASCII is sacred, whereas compatibility with 8-bit text encodings is not important.
Since we now have UTF-8 files with BOMs that need to be handled anyway, would it not be better if all the "Unicode-unaware" apps at least supported the BOM (stripping it, in the simplest case)?
3036e4|5 months ago
taffer|5 months ago
The correct approach is to use and assume UTF-8 everywhere. 99% of websites use UTF-8. There is no reason to break software by adding a BOM.
Cloudef|5 months ago
cryptonector|5 months ago
You _do_ need a BOM for UTF-16 and UTF-32.
mikelabatt|5 months ago
I also agree that "BOM" is the wrong name for an UTF-8... BOM. Byte order is not the issue. But still, it's a header that says that the file, even if empty, is UTF-8. Detecting an 8-bit legacy character set is much more difficult that recognizing (skipping) a BOM.