top | item 35060273

My Hardest Bug Ever (2013)

289 points| whack | 3 years ago |gamedeveloper.com | reply

128 comments

order
[+] lukeadamson|3 years ago|reply
One of my favorites:

I was working a large project for a wafer fab company, and occasionally the compiler would crash during full builds with SIGILL (illegal instruction, for those who aren’t familiar with the signal). Compiler bugs are never fun, and this was particularly vexing because it was so inconsistent.

It took me awhile, but eventually I got around to thinking: What could cause the compiler to execute an illegal instruction? What could cause an illegal instruction at all?

I removed the outer case from my computer, and sure enough, all of the fans had died. The CPU was overheating during intense, long-running builds. Replaced the fans and the “bug” went away!

*This is my first comment since I created my account in 2009. I hope I did it right! ;-)

[+] miki123211|3 years ago|reply
A friend of mine recently bought a computer with a really decent GPU (he needs to process significant amounts of non-English content through Whisper, which requires him to use the large model), and Whisper was running much, much slower than expected. It was a custom build, assembled by a small-ish company here in Poland. He opened the machine up, and it turned out there was some kind of foam inside that the company put there to secure all the components during transport. Thankfully, it was discovered and removed early enough not to cause any damage, but we were all very surprised.
[+] farhaven|3 years ago|reply
That reminds me of one of my childhood PCs.

It was a hand-me-down K6-II with (I think) 233 MHz clock rate. The thing had a tiny fan on top of the cooler that was roughly 4x4cm (if even that). The poor thing generally worked quite nicely, but had stuck bearings so it required a little nudge to spin every time the machine was turned on. I didn't have the side panel on because that was just too much of a hassle and usually would just reach in and start the thing blindly.

I forgot that one day and the machine had been running for about an hour (low load, so nothing too bad). I reached in and promptly gave myself quite a nasty burn blister because I touched the cooler instead of the fan.

[+] wingerlang|3 years ago|reply
Nice story. I've had something similar happen to myself. My computer generally worked well, the few times I tried to game though, it crashed after a while like clockwork. Turns out the fans on the PSU had died. Replaced it and never had those issues again.
[+] bmcooley|3 years ago|reply
Not a hardware bug, but in embedded I ran into a fun one early into my first job. I setup a CI pipeline that took a PR number and used it as the build number in a MAJOR.MINOR.BUILD scheme for our application code. CI pipeline done, everything worked hunky-dory for a while, project continued on. A few months later, our regression tests started failing seemingly randomly. A clue to the issue was closing the PR and opening a new one with the exact same changes would cause tests to pass. I don’t remember exactly what paths I went down in investigation, but the build number ended up being one of them. Taking the artifacts and testing them manually, build number 100 failed to boot and failed regression, build 101 passed. Every time. Our application was stored at (example) flash address 0x8008000 or something. The linker script stored the version information in the first few bytes so the bootloader could read the stored app version, then came the reset vector and some more static information before getting to the executable code. Well, it turns out the bootloader wasn’t reading the reset vector, it was jumping to the first address of the application flash and started executing the data. The firmware version at the beginning of the app was being executed as instructions. For many values of the firmware version, the instructions the data represented were just garbage ADD r0 to r1 or something, and the rest of the static data before getting to the first executable code also didn’t happen to have any side effects, but SOMETIMES the build number would be read as an instruction that would send the micro off into lala land, hard fault or some other illegal operation. Fixed the bootloader to dereference the reset vector as a pointer to a function and moved on!
[+] jesse__|3 years ago|reply
Another 10/10 bug, thanks for sharing
[+] adave|3 years ago|reply
From CI pipeline to bootloader would make me about turn and nope out of embedded so fast if that was my first job. That level of skill requirement is like a department in one. Hopefully that company had some patient seniors.
[+] rramadass|3 years ago|reply
I would be surprised if you have a full head of Hair after having dealt with that :-)
[+] PaulDavisThe1st|3 years ago|reply
Early 90s, doing the first implementation of scheduler activations in a real kernel on a real machine. There's an occasional bug that shows up, we think it's a race condition or something. After lots and lots of debugging and thinking, end up in the debugger approaching a line where we think the bug manifests (not caused, but manifests). Looks something this:

   int g = 2;
   if (g) {
      printf ("yes\n");
   } else {
      printf ("no\n");
   }
Obviously most of the time we see "yes", but every once in a while we see "no". Even in the debugger, using stepi, we hit the conditional, we confirm with the debugger that g is indeed non-zero. Totally impossible for the conditional to ever print "no", right?

------------

Well, when you're writing a re-entrant kernel context switch (as scheduler activations requires), you'd better damn well remember to restore ALL the registers on the processor, in particular the one that stores the result of a recent compare instruction.

We had skimped on this tiny step IIRC, one extra instruction in the context switch code); the kernel is interrupted after the compare instruction but before the jump; scheduler activations dictates switching to a new thread; when we come back to the original thread, the apparent result of the comparison is reversed, and we print "no".

At least the paper got an award at Usenix that year :)

[+] Tyr42|3 years ago|reply
Oi, I had the same big on an ARM chip. If ended up debugging by putting

  If (x) {
     If (!x) {
        Panic.
  ...
I wasn't saving the comparison bits on context switch.
[+] jesse__|3 years ago|reply
Awesome bug. Thanks for sharing
[+] hakeberio|3 years ago|reply
gnarly bug! what's the name of the paper?
[+] lukeadamson|3 years ago|reply
Another favorite:

Once upon a time, we got a panicked email from a customer whose OmniOutliner file would no longer open. He’d written a novel in it and was understandably keen to not lose his work.

Sure enough, when we opened his file with the debugger attached, it crashed immediately. Curiously, the crash was deep inside Apple’s XML parsing code, which we used indirectly by saving the file in their XML-variant of a property list.

Looking at the file in a text editor, we eventually found a funny-looking character where there should’ve been an angle bracket (an opening or closing bracket of an XML element). Inspecting it in a hex editor revealed that the difference between the actual character and what it should’ve been was precisely one bit.

How on Earth could that happen?! A bit more sleuthing (haha) uncovered more of these aberrations, and it didn’t take long before we realized that they occurred at regular intervals.

We patched it up, emailed it back to the customer, and suggested he check his RAM. He soon replied, thanking us but then asking, “How did you know I had bad RAM from my novel?!”

[+] ekimekim|3 years ago|reply
I encountered a similar issue once. The first indication something was wrong was weird corruption issues across a variety of services in our kubernetes cluster. In particular I focussed in on a service that took gzipped messages from a queue, which was reporting that some messages could not be decompressed.

First I confirmed that I could pull the corrupt message from the queue and it was in fact corrupt - so the problem was not in the consumer (which was throwing the error) or (probably) the queue, but rather the producer which created the compressed message.

On a hunch, I took a corrupted message (about 64KB in total) and wrote a quick program that took each bit of the message and tried the decompress operation with that bit flipped. Sure enough, there was one bit at offset 13000 or so which, if flipped, made the message decompress and at least visually appear intact.

Anyway, it turned out to be a single node with a hardware issue of some kind - rather than diagnose it fully we ended up just replacing the node. Repairing all the corrupted stuff that services on that node sent out was a much bigger concern.

[+] zerd|3 years ago|reply
And this is why ECC is a good thing.
[+] nl|3 years ago|reply
This was my worst bug, in the JVM(!) back in 2002: https://www.artima.com/forums/flat.jsp?forum=121&thread=1011...

> under JDK1.4.1 once 2036 files are open any subsequent opens will delete the file that was supposed to be opened.

Obviously this is bad.

It was worse to debug. "Opening files" includes opening Java class files or JARs, so we'd see a system with some class files or jars missing and spent ages trying to work out why deployment was failing.

Then I saw files class files disappear in front of me while I was using the system. That was one of the biggest WTF moments of my career. I assumed someone else was on the computer, then I assumed a virus, then hardware corruption.

It didn't occur to us to think the JVM would delete files instead of opening them for a long time.

Here's the reference in the Java bug database: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4779905

[+] chasd00|3 years ago|reply
God the JVM deleting files on its own must have been surreal to figure out. I would have taken the rest of the day off haha
[+] gumby|3 years ago|reply
My most memorable hardware bug was noware near as hard as this, but I'll never forget it.

Intel was trying to sell the 960s and sent us a dev board with that CPU. Nobody in the company could get it to boot up. It would power up but nothing would show up on the serial port. Eventually it was my turn to look and for some reason I happened to notice a pullup capacitor on the UART VCC. I looked at the schematics and indeed it was there. A simple jumper to bypass it (back in those days we had big, manly components; none of that surface mount shit) and what hey: the serial console responded. It had booted up just fine, but was mute.

After that we could do development but it was immediately clear to me that the 960 was DoA. It's not like we were the first to get that board!

[+] YZF|3 years ago|reply
I was debugging a TI DSP based board that I designed. It would come up, execute some of my code, and die. It took 3 weeks and lots of back and forth with TIs tech support before I found out that some of the ground pins were left disconnected. The guy that laid out the PCB didn't connect them though they were connected in the schematic. We went back to the PCB editor, zoomed way in, and lo and behold there was this tiny unconnected segment.

I don't remember the details any more (this was like 30+ years ago) but I think TI's support was on the right track and I was convinced this can't be true because I had the connection in the schematic. If you probe this with a DVM or a scope it will look connected (because the pins are connected internally) and so it's really hard to find out.

This taught me a valuable lesson that anything can be debugged with enough time and persistence. Some things take longer and that's life.

[+] rramadass|3 years ago|reply
>a pullup capacitor on the UART VCC

Why was it there? What were its two ends connected to? I have heard of pull-up resistors and decoupling capacitors but not a pull-up capacitor.

[+] eulgro|3 years ago|reply
So you're saying the 960 failed because of the bad development board?
[+] cushychicken|3 years ago|reply
I found a silicon bug in a memory chip once.

The chip was supposed to read out a unique ID, but instead read out all zeros. Doubly weird, because it was a flash chip. You’d expect a blank flash chip to spit out all 0xff, not all 0x00.

I ran it past the lead EE, and the lead software engineer, and the chip co FAEs, and they all said I must have done something wrong.

But they all came back later having repro’ed my demo.

Two months of kicking it up the chip co later, I got a nice note from the CEO of that chip company saying “Thanks for the bugfix” - with a bottle of Dom Perignon.

That was a cool career highlight.

[+] winrid|3 years ago|reply
My recent weird "bug" was when I installed a new Linux distro, just last week, to get away from weird graphical issues with KDE (switched to PopOS for hardware support).

On boot, my mouse started moving really erratically. I would try to move it and it would just jump all around the screen, but only with Razor mouse, not my logitech one.

Great, I think, I traded display issues for mouse driver issues. But it was weird, because it was fine during the live USB.

I spent a bit of time debugging inputs etc, maybe it's a weird driver issue.

I suddenly remembered in my HS days when the school ordered new mousepads which had bright yellow lines on them from some logo, making them incompatible with the "new" laser mice.

It was some cat hair on the sensor :D

[+] chasd00|3 years ago|reply
I was working on a rx drug pricing system right out of college. I couldn’t always get my price calculations to match what a major insurance carrier came up with and the contract clearly stated the formula. Turned out the big carrier had a bug in their calculations that surfaced only under a specific set of circumstances. I felt very proud of myself for figuring out their big and did a detailed write up and submitted it to the carrier. Their response was “yeah we know, we’re not going to fix it though”. That floored me but I was right out of college and pretty naive hah.
[+] toolslive|3 years ago|reply
I once (about 10y ago) experienced hardware that got tired. A customer replaced the usual hard disks with shiny new Seagate SMR drives, because they had more storage capacity. Funny thing is that they could not handle the sustained 100MB/s we were feeding them. So after about 20 minutes they started slowing down and after half an hour they stopped working for about 20 minutes and then they were fine again. Obviously the customer complained about our storage product and forgot to mention this small fact. Once we figured it out we had good laugh.
[+] _a_a_a_|3 years ago|reply
That's interesting. My old server about 10 years ago had a Seagate black which died. I replaced it with a Seagate green. I notice things started slowing down and down when the disc writes got heavy. It could freeze up for minutes at a time, then recover without any errors. It took me weeks to realise what was happening because… Because I don't actually know why. In hindsight it was obvious. Maybe the Seagate green was a SMR drive. Either way, it was nasty and caused a lot of frustration.

A quick check just now and it seems that the Seagate green were SMR. Fuckers never put that on the box did they. Bastards.

[+] mnw21cam|3 years ago|reply
I bought one of the first versions of those shiny Seagate SMR drives, specifically to store my (encrypted) backups. It failed after a couple of months, so I returned it and got a replacement. Which failed after a couple of months. So I joined the large chorus of "I'll never buy Seagate again".
[+] eyelidlessness|3 years ago|reply
This story is fascinating in a lot of ways, but one which jumps out at me is: I don’t think the particular pre-“aha!” wondering about timing would ever occur to me in the domains I’ve worked. I guess maybe I’d discover it in the repro isolation process because that elimination is often very illuminating (it’s basically how I taught myself to program!), but it wouldn’t ever come to mind unless I was staring at it while debugging.

Say what you want about the ills of high level abstractions, but not having to think about the implementation details of clock sync all the way down to the metal is a pretty nice convenience when you can afford it.

[+] rramadass|3 years ago|reply
One of mine;

I was implementing a TCP split proxy(using Adam Dunkels' lwIP stack) on a custom SoC with a 16-way multi-core(ARM+MIPS ISA mishmash) for the data plane. Memory was divided into different regions each with a specific set of policies. I had gotten my single-core proxy working and then added a Mutex to the TCP control block to parallelize my code across all the cores. Testing resulted in a fatal crash. After rolling back the checkins one by one, i narrowed down the problem to the load-link/store-conditional instructions(LL/SC) used to implement the Mutex. Now i was stuck with no clue as to why executing these instructions resulted in a crash. Cue me cursing everything about the chip in my cubicle. One of the senior engineers who was there in the beginning during the design of the SoC and hence knew its quirks heard my lamentation, came over, took a look, and promptly solved the problem. Remember the different policies for the different memory regions i mentioned earlier? It turns out that i had placed my TCP control bock and hence the Mutex in it in a certain region of memory where LL/SC instructions were inadmissible thus resulting in the crash. Shifting that data structure to a different region of memory solved the problem.

Lesson learned: When working on a custom SoC take nothing for granted even hardware instructions.

[+] andreareina|3 years ago|reply
A but of a nit but I think cross-talk is well modeled by classical signal theory, there's no need to invoke quantum mechanics, no?
[+] theposey|3 years ago|reply
technically everything is caused by quantum mechanics, not a great claim lol
[+] glonq|3 years ago|reply
Having spent the better part of 30 years working on/with/around embedded systems, I can't even count how many bugs I've bumped into that were hiding inbetween sofware and hardware. Or between software and compiler/tools/OS. Or between hardware and spooky RF black magic.
[+] hnthrowaway0315|3 years ago|reply
I'd love to hear all of those. What RF black magic?
[+] Saigonautica|3 years ago|reply
My favorite bug this month was while setting up a development environment with the AVR-ICE.

I tried to save some company money by not buying the (optional) case and programming cable assembly -- figured I could just use another not-80$ SWD cable (also 3d-printed a case and a SOT-23-6 programming adapter).

After much cursing and hair pulling, I noticed that the header for the SWD cable was installed upside-down on the PCB. So the red wire on the ribbon cable was pin 10 instead of pin 1. In their defense, they did correctly indicate this on the solder mask, I just didn't see it through the (opaque) case.

My best guess as to why the cable assembly costs 80$ is that they again reverse the pin order on it to silently fix the bug on the PCB instead of just shipping a standard cable.

It turned out to be worth the engineering time to deal with the bug, but not by as much as I hoped. It's a pretty neat product despite this bug, definitely more modern than the venerable STK500 that I used previously (which itself had been converted to a USB device after the level converter failed).

[+] grujicd|3 years ago|reply
Worked on a chart generating service in Java some 20 years ago. At that time IBM released their JVM. Upon first tests it worked perfectly and significantly faster than Sun's JVM. After testing it further, making tens of thousands charts, we deployed it to production. However, in production it would stop misteriosly after some time. Added a lot of logging, there were no issues in our code. After a while I realized it failed somewhere after 65536 charts were made! That was pretty suspicious. There's nothing in our code that would overflow some 16-bit counter, it worked under another JVM, and crash was not a Java exception. If I remember correctly it was not even a crash at all but entire process would freeze.

It turned out it was a problem with that specific IBM JVM. We created a new thread for each chart, and that JVM froze after 65536 created threads! Moral of the story, if you already test with tens of thousands requests, make sure it's at least 64k tests.

[+] 8fingerlouie|3 years ago|reply
A decade ago i worked with sortation devices for postorder companies, and one of our clients reported that they sometimes had issues with items that were sorted wrong, but were unable to reproduce it. They used trays for sorting, and each tray had a barcode with a uniqe id.

I spent a LONG time looking at logs until i ended up enabling debug logging, and because the site was on a 1200 baud modem i had the client burn the logs to a DVD media and ship them to us.

I ended up writing i piece of perl code to parse the logs and insert them into a MySQL database where i could then trace the individual sorter trays by id, and by some obscure miracle of sleep deprevation and too much coffee, i manage to find a correlation.

Turns out the bug only showed up when a tray had been used for sorting inbound items, then reused for sorting outbound items, and when used for sorting inbound items again (not outbund, which would reset it), then the bug would happen.

The fix was traced to a single line in an if/else statement.

Time to fix : around 1 hour including tests.

Time to find bug : around 300 hours.

Of something more relevant to the article, i used to write operating systems for mobile phones, and we spent A LONG time debugging an issue where our brand new display driver was acting up.

After attaching a lauterbach debugger we finally managed to track it down to the compiler.

Turns out :

    int i = 1+2+3;
would mean i=3 in the code as the compiler only considered the first two variables in the assignment list.

Another fun feature of that compiler was the fact that when you incremented heap memory past the memory page, it would forget to increment the page pointer, meaning it simply just wrapped to 0 and the memory you referenced was nowhere near what you'd expect :)