top | item 45991535

(no title)

apple1417 | 3 months ago

My most common source of unintentional BOMs is powershell. By default, 'echo 1 > test' writes the raw bytes 'FF FE 31 00 0D 00 0A 00'. Not too likely for that to end up in a shell script though.

discuss

rmunn|3 months ago

That's UTF-16, though, which is basically the only place where byte-order marks are actually useful. (It was arguably the thing they were designed for, being able to tell whether you have little-endian or big-endian UTF-16. I've never encountered UTF-16BE personally so I can't easily tell how widespread it is, but I suspect that while its use is rare, it's unfortunately not zero so you still need to know what byte order you have when encountering UTF-16.

Though that would have been a far worse problem a decade ago. Thankfully, these days any random document you encounter on the Web is 99% likely to be UTF-8, so the correct heuristic these days is "if you don't know the encoding, try decoding with UTF-8 first. If that fails, try a different encoding." Decoding UTF-8 is practically guaranteed to fail on a document that wasn't actually UTF-8, so trying UTF-8 first is a very safe bet. (Though there's one failure mode: ASCII encoded as UTF-16. So you might also try a heuristic that says "if the UTF-8 parses successfully but has a null byte every other character in the first 256 bytes, then try reparsing the document as UTF-16").

I'm rambling and getting off my point. The point is, I don't find BOMs in UTF-16 documents to be a problem, because UTF-16 actually needs them. (Well, probably needs them). UTF-8 doesn't need them at all.