> If we drop those markers (1110 and 10 in front of bytes) and keep the remaining bits we're left with 1111111011111111, which evaluates to 65279, which is in hexadecimal 0xfeff. Yes, you recognize it, it's a BOM. Because yes a BOM is just a ZERO WIDTH NO-BREAK SPACE, isn't it beautiful?
Byte Order Marks have stolen hours and days of my life. Anyone suffering the pain of developing on a windows box can relate. Windows puts BOMs by default in the front of every file. Thus windows programs silently ignore it, but then linux machines run the program and choke on the BOM. You have to specifically ask the editor if the BOM is even there, it doesn't show up in the editor by default. I have specific lines in my .vimrc[1] that prevent BOMs from ruining my day/week, but they still pop up often. I often joke there will be a byte order mark on my tombstone, along with avahi daemon.
> Byte Order Marks have stolen hours and days of my life.
Me too, to some degree. I have discovered them in a Ruby code base at work, in the middle of a line of code (copy pasted), where the Ruby interpreter thinks they are undeclared identifiers. When the code runs, it throws an exception every time that complains of “Undeclared identifier `‘”.
The dad-joke of it is that “You gotta sweep for BOMs before they blow up your code.”
I have written a cross platform, stand alone CLI program to inspect a file for BOM8 and BOM16. It also detects if a file uses CRLF or LF. Tab and nul characters are also evaluated. Please see the Examples in my repo:
I've dealt with two elusive bugs which were ultimately caused by Windows stupidly using UTF-8 with BOM by default. Python requires you to take extra steps to decode that garbage, and some C++ libraries can't handle it at all.
I'm sure there were good reasons that BOM sounded like the right idea at Microsoft, but everyone else just used straight UTF-8 and it was fine.
My high school english classes would upload any papers students wrote to a site that would check for plagiarism. I figured out that if I inserted random zero-width no-break spaces in the middle of words my plagiarism score would drop to zero.
Presumably the plagiarism system was just looking for exact matches of long substrings.
I mean, it's the shortest _possible_ pull request (since I don't think you can make a git diff of zero bytes, barring some weird quirk), but also probably has the highest PR description : PR diff length ratio of any PR I've seen.
Given that a BOM is three bytes, I don’t really agree that it’s the shortest. How about replacing a CRLF by LF? That one is invisible in many contexts as well.
I love the title. I like to ask for pull requests with this exact description to influence my coleagues to look at it faster when it's something small like a single character. For example, when it's a two character PR I say "hey, the second smallest pr in the world". Guess I was wrong!
You can easily check in your CI that your files are ASCII (code should probably be) with file(1). There is probably an off-the-shelf-tool that can validate that all characters are printable, ASCII or unicode.
djha-skin|3 years ago
Byte Order Marks have stolen hours and days of my life. Anyone suffering the pain of developing on a windows box can relate. Windows puts BOMs by default in the front of every file. Thus windows programs silently ignore it, but then linux machines run the program and choke on the BOM. You have to specifically ask the editor if the BOM is even there, it doesn't show up in the editor by default. I have specific lines in my .vimrc[1] that prevent BOMs from ruining my day/week, but they still pop up often. I often joke there will be a byte order mark on my tombstone, along with avahi daemon.
1: https://git.sr.ht/~djha-skin/dotfiles/tree/main/item/dot-con...
ddalcino|3 years ago
Me too, to some degree. I have discovered them in a Ruby code base at work, in the middle of a line of code (copy pasted), where the Ruby interpreter thinks they are undeclared identifiers. When the code runs, it throws an exception every time that complains of “Undeclared identifier `‘”.
The dad-joke of it is that “You gotta sweep for BOMs before they blow up your code.”
jftuga|3 years ago
https://github.com/jftuga/chars
TillE|3 years ago
I'm sure there were good reasons that BOM sounded like the right idea at Microsoft, but everyone else just used straight UTF-8 and it was fine.
AceJohnny2|3 years ago
Tell us more!
Am4TIfIsER0ppos|3 years ago
Sounds like that is a good choice for the option name
donatj|3 years ago
They always end up +0-0 - see:
https://github.com/ICanBoogie/Inflector/pull/38
christiangenco|3 years ago
Presumably the plagiarism system was just looking for exact matches of long substrings.
benj111|3 years ago
I hope it returns the copied string.
String "" is plagiarised
donatj|3 years ago
frabjoused|3 years ago
mr337|3 years ago
verandaguy|3 years ago
AceJohnny2|3 years ago
dixie_land|3 years ago
silverwind|3 years ago
layer8|3 years ago
mjochim|3 years ago
k470|3 years ago
remram|3 years ago
pabs3|3 years ago
jftuga|3 years ago
It determines the end-of-line format, tabs, bom, and nul characters:
https://github.com/jftuga/chars
tmaly|3 years ago