On a similar theme, I remember reading a story about a server that would crash mysteriously every couple of weeks. They eventually worked out that this happened whenever there was a new moon or a full moon, and the resulting high tide caused a battleship moored in a nearby harbour to rise just high enough that its powerful radar would interfere with the server.
On a much smaller scale, I once worked for a wireless ISP. We had a customer who called in late September saying her service had been out for a few weeks. I went to her house and discovered that she was in a wheelchair and couldn't reach the controls for her air conditioner, so she was turning it on and off using the on/off switch on a power strip that was sitting on a desk. Her router was plugged into the same power strip. So as soon as the weather got cool enough to not need the AC, she lost her internet.
I once experienced a moored ship whose satellite Internet was extremely unreliable. It worked for 2 seconds and then it stopped working for two seconds, over and over. Only time it worked reliably was when there were no wind at all. After checking satellite images and corresponding to antennas pointing angle and ships position we eventually figured out that the antenna was pointing directly towards a wind mill. So when the rotor was turning it was intermittently blocking the signal between the satellite and the antenna. Luckily they were able to move the ship 50 meters forward and magically the Internet started working again.
I recall in the early days of WiFi, the advice around ops circles was that if you were trying to bridge two buildings using wireless, you had to set it up in the summer, not the winter. Because the water in the leaves of deciduous trees is enough to attenuate the signal. So now you've gone from "it's working" to "we have to start over," or worse, "yeah we can't actually do this."
There must be a site with all of these stories but could not find it right now. The story about the car that would break down if you buy vanilla Ice Cream is one of my favorites. There’s also the story about the switch that is not connected anywhere but crashes the server every single time.
> "When did this start? A
few days ago, you said, but did anything change in your systems at that time?"
> "Well, the consultant came in and patched our server and rebooted it.
> Having established that--unbelievably--the problem as reported was true,
and repeatable
As far as problems go, this is an easy one to solve. Accurate description of the problem. Accurate reporting of what changed. Problem is consistently repeatable.
Compared to a problem that doesn’t reliably reproduce and for which the person reporting it claims nothing changed, this one is child’s play.
But it’s still amusing to read every time.
I had something similar occur in the early 2000s on a patch release of Solaris 2.6 (I think) where the sleep call was broken and would always return almost instantly. This caused all sorts of weird behaviors on the running system.
I also recall the first time I ran into an issue with MTU on a dedicated frame relay link we had to admin our web farm in the late nineties. One day a developer reported they could login to our bastion but when they ran “ls -l” in a big enough directory their ssh connection would hang. It turned out the connection would hang whenever a packet was generated near the MTU and we eventually tracked it down to an issue with the frame relay connection. We played with MTUs until we found out what worked. It then took a while to convince our provider what was going on but they eventually replaced a line card on the far end of the connection which allowed us to re-raise the MTU to 1500.
Another fun problem I had was an email to SMS gateway I wrote for myself that worked by posting to a Verizon web form. I developed it on a Mac (probably 2002 or so) where it worked fine but when I deployed it to my Linux box on my same home network, the script couldn’t connect to Verizon’s site. It turned out the Linux box was a setting a flag on the TCP connection (ECN I think) that was tripping up Verizon’s web firewall. The work-around was disabling ECN on the Linux box.
> Compared to a problem that doesn’t reliably reproduce and for which the person reporting it claims nothing changed, this one is child’s play.
You forgot +inaccurate reporting of the problem.
We had an employee in IT at a client who would tell us "It's broken." This went on, for every report, for three years, with us asking the same follow-up questions every time. For who, in what way, when doing what, what changed, etc.
As far as I know, that individual still works there.
People like that are the best argument for why a basic income and removing some folks from the workforce would increase efficiency.
In the mid-90s I used to repair PCs. Customer brought PC in where left mouse button did not function.
Easy. Replace mouse. NOPE.
OK, software issue. Reinstall mouse driver. NOPE.
OK, deeper software issue. Replace HDD from working PC. NOPE.
OK, replace RAM? NOPE.
OK, replace motherboard and all add-in cards. NOPE.
At this point we have a different HDD, motherboard, CPU, RAM, video card and mouse. Still left mouse button doesn't work. Mouse moves fine. Right button works.
This reminds me a problem I'm currently having. My iPhone freezes completely sometimes when I ride BART, requiring a hard reboot. I notice it happens when passing the Daly City station. It seems there's a strong signal tower nearby that the strong signal causes the problem. It's probably the strength level read by the hardware causing an out of bound error somewhere and corrupting the phone's memory.
Do you have an IMSI Catcher [0] detection app on your phone?
I used to have the same issue (EU country), using Metro. One single stop which was above ground and near International conference centre. Evertime I went through that staion my phone would lock up. Needed Hard hard reboot (remove battery), Until I removed the IMSI catcher detection software. After that I used in flight mode using that metro line.
Does this happen to anyone else? There aren't a ton of iPhone variants out there, so if it's a baseband-level defect, it would be happening a lot.
I'll also say, if it's just the signal strength being too high, it seems unlikely that would cause memory corruption. The signal strength is probably just an integer, and there aren't any operations defined on integers that involve using other bytes of memory. (If you have an uint8 and add 1 to 255, you just get 0; it doesn't upgrade the integer to a uint16 and overwrite adjacent memory.)
At a very large bank here in Australia & NZ, all XML messages going through the main message bus had a trailing space character appended to the end, which broke XML validation on the receiving endpoint.
So the solution was for all endpoints to trim the very last character - not just if it was a space, but to chop off the last character. Apparently this had been the solution for years.
This worked really well until one day someone (probably a new grad) saw the character issue and figured they'd fix it.
A bank-wide P1 incident occurred because every single XML message was now unparsable due to the malformed closing '</xml ' tag. Every single application in the bank had to do an emergency update on its XML parser.
A classic and wonderful piece of internet lore. If I ever have kids this is one of the ones I'll be telling around the campfire. The one about the internet going down because the delivery truck blocked LoS is a good one too.
There's no reason you couldn't, but distance is not the only source of latency, so you're unlikely to find an existing case of someone doing that intentionally.
Easy enough to whitelist geo-ip matches or large net block ranges for a similar result.
It happened to me, too.
Back in 1995 I was in charge of the Sparc server that handled email.
I got a call telling me the we couldn't send mails outside Spain.
Back then, we had a slow internet connection (128K if I remember well) and sometimes the academic network had issues speaking with the outside world.
Two days later we had more complaints. This time it couldn't be a network issue.
We had the same problem: one OS upgrade made sendmail use a default config, not ours.
Fortunately mail didn't bounce, and after the fix the server was above 20 load average for two days.
Less of a crazy bug than a funny one: I had a friend named Peter March. When his pay check fell on April 1st and was made out to Peter April he obviously thought it was an April Fools joke. It wasn't.
I worked for a company where a proxy server was used for all internet access. Every now and then a pages would not load. Error logs pointed me to the usual culprit - dns. When looking at dns traffic in tcpdump everything looked normal, except some dns replies came from rfc1918 addresses instead from the dns servers public IP address. When I talked to the ISP, they blamed me (the proxy) for reusing UDP sockets, and it was by design that their load balancer would only support one DNS request at a time per UDP socket. If there were two or more in flight, only the first response would be NATed properly. Luckily, I knew that the ISP used the same proxy product internally, so when I asked how they configured their proxy to avoid this issue, they fixed the load balancer within the hour
We have a banking website which refuses to login when I connect on the 5G Wifi but allows me to login when I connect on the regular 802.11 WiFi (non-5G mode). How does the website login know which WiFi speed am I connecting on?
Back in the day, I had a Nokia E71 smartphone that I used to keep next to my work provided laptop.
My laptop would freeze for a couple of seconds right before each incoming call. Every single time.
It wasn't all that baffling to me so I decided to test the thing while placing the phone on top of my huge desktop tower. My over clocked computer simply restarted itself instead of freezing. Props to Lenovo, I guess
Anyone experienced an old VisualStudio (was it with VC6 still) bug, where the compiler would flag the last line of the source file as an error, when it did not terminate with a CR/LF newline? All code would clearly look correct in the editor, yet could not be built.
[+] [-] PhilRodgers|4 years ago|reply
[+] [-] gabriel_fishman|4 years ago|reply
[+] [-] Thlom|4 years ago|reply
[+] [-] hinkley|4 years ago|reply
[+] [-] moepstar|4 years ago|reply
Turns out, this always happens once a train with radioactive waste on it passes by - causing a few bits to flip in memory and a subsequent crash...
Can't seem to find the story online tho...
[+] [-] jolmg|4 years ago|reply
https://news.ycombinator.com/item?id=28688090
The first reply to that has a story that seems similar to the one you remember:
https://news.ycombinator.com/item?id=28689288
[+] [-] mrtksn|4 years ago|reply
[+] [-] qznc|4 years ago|reply
[+] [-] js2|4 years ago|reply
> "Well, the consultant came in and patched our server and rebooted it.
> Having established that--unbelievably--the problem as reported was true, and repeatable
As far as problems go, this is an easy one to solve. Accurate description of the problem. Accurate reporting of what changed. Problem is consistently repeatable.
Compared to a problem that doesn’t reliably reproduce and for which the person reporting it claims nothing changed, this one is child’s play.
But it’s still amusing to read every time.
I had something similar occur in the early 2000s on a patch release of Solaris 2.6 (I think) where the sleep call was broken and would always return almost instantly. This caused all sorts of weird behaviors on the running system.
I also recall the first time I ran into an issue with MTU on a dedicated frame relay link we had to admin our web farm in the late nineties. One day a developer reported they could login to our bastion but when they ran “ls -l” in a big enough directory their ssh connection would hang. It turned out the connection would hang whenever a packet was generated near the MTU and we eventually tracked it down to an issue with the frame relay connection. We played with MTUs until we found out what worked. It then took a while to convince our provider what was going on but they eventually replaced a line card on the far end of the connection which allowed us to re-raise the MTU to 1500.
Another fun problem I had was an email to SMS gateway I wrote for myself that worked by posting to a Verizon web form. I developed it on a Mac (probably 2002 or so) where it worked fine but when I deployed it to my Linux box on my same home network, the script couldn’t connect to Verizon’s site. It turned out the Linux box was a setting a flag on the TCP connection (ECN I think) that was tripping up Verizon’s web firewall. The work-around was disabling ECN on the Linux box.
[+] [-] ethbr0|4 years ago|reply
You forgot +inaccurate reporting of the problem.
We had an employee in IT at a client who would tell us "It's broken." This went on, for every report, for three years, with us asking the same follow-up questions every time. For who, in what way, when doing what, what changed, etc.
As far as I know, that individual still works there.
People like that are the best argument for why a basic income and removing some folks from the workforce would increase efficiency.
[+] [-] kingcharles|4 years ago|reply
Easy. Replace mouse. NOPE.
OK, software issue. Reinstall mouse driver. NOPE.
OK, deeper software issue. Replace HDD from working PC. NOPE.
OK, replace RAM? NOPE.
OK, replace motherboard and all add-in cards. NOPE.
At this point we have a different HDD, motherboard, CPU, RAM, video card and mouse. Still left mouse button doesn't work. Mouse moves fine. Right button works.
Only thing left is the case and the PSU.
Replace PSU. Left mouse button works perfectly.
FML.
[+] [-] lostgame|4 years ago|reply
Solving programming problems is sometimes similar.
[+] [-] glitchc|4 years ago|reply
[+] [-] ww520|4 years ago|reply
[+] [-] not1ofU|4 years ago|reply
Edit: rooted / android / HTC phone. [0]: https://en.wikipedia.org/wiki/IMSI-catcher
[+] [-] jrockway|4 years ago|reply
I'll also say, if it's just the signal strength being too high, it seems unlikely that would cause memory corruption. The signal strength is probably just an integer, and there aren't any operations defined on integers that involve using other bytes of memory. (If you have an uint8 and add 1 to 255, you just get 0; it doesn't upgrade the integer to a uint16 and overwrite adjacent memory.)
[+] [-] 3np|4 years ago|reply
[+] [-] geoffmunn|4 years ago|reply
So the solution was for all endpoints to trim the very last character - not just if it was a space, but to chop off the last character. Apparently this had been the solution for years.
This worked really well until one day someone (probably a new grad) saw the character issue and figured they'd fix it.
A bank-wide P1 incident occurred because every single XML message was now unparsable due to the malformed closing '</xml ' tag. Every single application in the bank had to do an emergency update on its XML parser.
[+] [-] lqet|4 years ago|reply
[+] [-] potamic|4 years ago|reply
[+] [-] vishnugupta|4 years ago|reply
https://fs.blog/chestertons-fence/
[+] [-] dang|4 years ago|reply
We can't send email more than 500 miles (2002) - https://news.ycombinator.com/item?id=23775404 - July 2020 (135 comments) (<-- thanks ayewo for finding this!)
The case of the 500-mile email (2002) - https://news.ycombinator.com/item?id=14676835 - July 2017 (56 comments)
Every time we lift a pallet from the shipping room, the server times out (2006) - https://news.ycombinator.com/item?id=13347058 - Jan 2017 (82 comments)
The case of the 500-mile email - https://news.ycombinator.com/item?id=10305377 - Sept 2015 (1 comment)
The 500-mile email (2002) - https://news.ycombinator.com/item?id=9338708 - April 2015 (139 comments)
The case of the 500-mile email - https://news.ycombinator.com/item?id=1293652 - April 2010 (24 comments)
The case of the 500-mile email - https://news.ycombinator.com/item?id=385068 - Dec 2008 (28 comments)
The case of the 500-mile email - https://news.ycombinator.com/item?id=123489 - Feb 2008 (7 comments)
[+] [-] J-Kuhn|4 years ago|reply
Edit: Here is one: https://bugs.launchpad.net/ubuntu/+source/cupsys/+bug/255161...
[+] [-] zenexer|4 years ago|reply
[+] [-] ayewo|4 years ago|reply
We can't send email more than 500 miles (2002) - https://news.ycombinator.com/item?id=23775404 - July 2020 (136 comments)
[+] [-] abeppu|4 years ago|reply
[+] [-] trollied|4 years ago|reply
[+] [-] thot_experiment|4 years ago|reply
[+] [-] abalaji|4 years ago|reply
[+] [-] post-it|4 years ago|reply
[+] [-] jancsika|4 years ago|reply
Like an ssh setting that only allows incoming connections that can prove (well, suggest) their proximity by a series of latency tests?
[+] [-] rtkwe|4 years ago|reply
[+] [-] netflixandkill|4 years ago|reply
Easy enough to whitelist geo-ip matches or large net block ranges for a similar result.
[+] [-] betaby|4 years ago|reply
[+] [-] dataflow|4 years ago|reply
[+] [-] j0e1|4 years ago|reply
[+] [-] sodality2|4 years ago|reply
[+] [-] asicsp|4 years ago|reply
[+] [-] eb0la|4 years ago|reply
Good news was no spam came that week.
[+] [-] tempestn|4 years ago|reply
[+] [-] hoppla|4 years ago|reply
[+] [-] atsaloli|4 years ago|reply
http://verticalsysadmin.com/blog/sysadmin-war-story-the-netw...
[+] [-] darekkay|4 years ago|reply
[+] [-] ronzensci|4 years ago|reply
[+] [-] 3np|4 years ago|reply
[+] [-] hoppla|4 years ago|reply
[+] [-] beebeepka|4 years ago|reply
My laptop would freeze for a couple of seconds right before each incoming call. Every single time.
It wasn't all that baffling to me so I decided to test the thing while placing the phone on top of my huge desktop tower. My over clocked computer simply restarted itself instead of freezing. Props to Lenovo, I guess
[+] [-] zoomablemind|4 years ago|reply
[+] [-] tardismechanic|4 years ago|reply
Sigh, I miss him so much...
[+] [-] IncRnd|4 years ago|reply