top | item 21466104

My Hardest Bug to Debug (2018)

136 points| jsnell | 6 years ago |programminginsteeltoecaps.com | reply

71 comments

order
[+] AnimalMuppet|6 years ago|reply
Here's my worst:

We had a function that looked like this:

  void f() {
    bool flag = true;
    while (flag)
      g();
  }
This function exited sometimes, which should not be possible.

You say, "Duh, Mr. AnimalMuppet, clearly g is smashing the stack!" But the return address wasn't getting trashed - just the value of flag.

This bug also would disappear if you did something like, for example, print out the address of flag so you could watch it in a debugger.

I chased this bug for a month, off and on. Finally I got desperate enough to print out the assembly produced by the compiler, and things got clearer.

flag was a register variable. (This was gcc compiling for an ARM CPU, by the way.) It lived in R11 (or maybe R12, it's been a long time). When f called g, it just pushed the return address on the stack. But g was going to have its own value to put in the R11, so it pushed R11 onto the stack just before allocating space for its own variables. So f's local variable wound up in g's stack frame...

... and g was smashing the stack. Duh.

In particular, g was using queues (msgsnd and msgrcv) to exchange messages with another CPU. The API here is misleading. These functions expect to send a packet of the form

  struct msgbuf {
    long mtype;
    char mtext[1];
  };
and they take a size parameter, among others, because you actually pass in a different structure, with an mtext array big enough to carry your actual information. But the size you pass to the function is expected to be the size of the actual mtext array, not the size of the actual structure (that is, the size doesn't include the size of the mtype field).

The contractors who wrote the code didn't know that. They assumed that they could just pass sizeof(type_similar_to_msgbuf) into msgsnd/msgrcv, and it would work. In fact, they were sending and receiving 4 bytes too many. Which kind of worked, in that the sender and receiver never got out of sync. But when the receiver got four bytes too many, it trashed what was just past it on the stack, which turned out to be the stored value of R11.

The net result was that the flag would get cleared if, on a different CPU, four bytes of unrelated memory were zero.

[+] StillBored|6 years ago|reply
Well you didn't mention valgrind, but this is the kind of thing it should have caught. In fact if your programming in C/C++ running the application with various valgrind, glibc MALLOC_CHECK_, boundschecker, purify, etc tools should be done on a regular basis as part of your integration testing.

For sure boundschecker saved myself and various coworkers days of debugging a couple companies back. Its one of those tools that pays for itself the first time it finds a truly evil bug.

Also, POSIX is full of gocha's (personally I think its worse than W32) just waiting to catch the unwary. FD_SETSIZE with select() bit me hard a few years back by causing a bug that only happened when the file descriptors eventually got > FD_SETSIZE, but that was hardly the worst bug I've found.

[+] saagarjha|6 years ago|reply
Psst…I'd mark that flag volatile or atomic if I were you; a smart C11 compiler might mark your loop as terminating ;)
[+] Smithalicious|6 years ago|reply
Ah yes, our old friends of impossible bugs: compiler optimizations, disappearing when observed, and triggered far, far away. Yikes, man.
[+] shdon|6 years ago|reply
I think my hardest thing to debug wasn't really a bug at all. It was having to clean up some legacy code I'd inherited and one function which was a bottleneck had a few hundred lines of horribly complicated code with lots of calculations which didn't make sense to me. It took a lot of time to figure out that the calculations were wholly unnecessary. The results of the calculations either got discarded or unconditionally overwritten. In the end, the refactoring consisted of a single press of the "delete" key, but building up the confidence to press it was what cost a lot of time and effort.
[+] bballer|6 years ago|reply
Maybe your case was more complicated but most IDE's now will tell you about unused variables, methods/functions etc. So if variables storing the calculations that were being built up never ended up being used in the end context of the function you may have gotten a nice gray/red line. It becomes harder to trace if those values are being updated into some table which then never gets used.

It is amusing that sometimes huge swaths of complicated code exist simply because everyone inheriting them is to scared to touch them even though they are basically orphaned and have a net 0 effect on outcome.

[+] voidmain|6 years ago|reply
It sounds like the camera was sending match results using a blocking send or write call, and no one was reading them, so once the total size of match results exceed the total size of socket buffers the camera software blocks and stops working.
[+] msandford|6 years ago|reply
That was what I thought too! I've been bitten by a similar issue where I thought I was dodging it by using a size unlimited Python Queue but the underlying buffer filled up anyhow.
[+] mokus|6 years ago|reply
Yup, that’s exactly my thought too. I’ve seen this in USB virtual serial port debugging connections too. The software is writing debug messages, nobody reads them, and a day or two later everything just stops because the debug connection’s buffer is full.
[+] plasma|6 years ago|reply
Definitely seen this too.

There may be an OS buffer that becomes full (holding data from network but app has not read from OS buffer) so now it’s blocked.

[+] mc3|6 years ago|reply
I hate questions like "What's the hardest bug you've ever debugged?". It means as I work I need to keep a scrapbook of "will what I did today make a great interview answer in 5 years time". I'd rather just fix the bug and forget about it.
[+] locusofself|6 years ago|reply
I kindof agree. My memory doesn't work like this, I could spend a week troubleshooting, following various twists and turns and eventually finding a solution with a big aha! moment, and not remember much of anything about it a week later, at least in terms of spontaneous voluntary recollection.
[+] lostcolony|6 years ago|reply
So my first thought was the same thing, and so I started thinking "What would I say"? And while I started with "Hmm, I don't really remember hard bugs; I remember times I spent days, and the lessons I took away, such as the importance sometimes of putting a problem down and sleeping on it...I guess this one time when I had to figure out why something was so slow and eventually tracked it down to an O(n^2) implementation when it could have been linear was cool...oh, but wait, we actually had something here where due to mutable state we were growing a list without end and it was being sent out across the network, and we hadn't tested it in staging because it was just for metrics, and that ended up taking out a downstream system in prod" and...well, I found some stuff. I think if asked something interesting, at least, will come to mind before too long.
[+] pflenker|6 years ago|reply
As someone frequently interviewing people, I ask these kinds of questions because I am interested in what people perceived as a significant challenge in their career, how they tackled it and how they remember it.

THat having said, I would never judge the answer to this question without the context of other questions and answers.

[+] sqldba|6 years ago|reply
Ditto!!!

I’ve deep dived, investigated, and fixed tonnes of hard problems.

But my memory doesn’t have an index on them. I can’t think of any on the spot and sort them hardest to easiest.

I can think of achievements (and they end up on the resume of course) but those aren’t necessarily debugging a hardest bug, which is a point in time versus months/years of invested effort.

[+] saagarjha|6 years ago|reply
You don’t have to be truthful; I’m sure they’d settle for any complex or interesting issue you’ve had.
[+] mobilemidget|6 years ago|reply
true, so I just know a recent annoying one, gdb always stops on sigpipe. Quite annoying when you add a signal handler to ignore sigpipes; yet it keeps exiting... only to realise an hour later that gdb decides that for you on default. Though clear example of RTFM, PEBCaK.
[+] blantonl|6 years ago|reply
I'm not going to lie, but this article left me absolutely hanging off a huge cliff with no one to rescue me.

The actual bug was not fixed!!

[+] lisper|6 years ago|reply
Yes, it was.

> We looked at how we communicate with the camera. We opened a TCP connection the first time it was needed, and left it open for changes to camera settings via the application. We modified this to close the connection once we had sent the required information and open it again if we needed it. We tested this thoroughly over the next few days and it looked solid.

Furthermore...

> I still don't understand how this caused the camera to lock up. We were receiving the TCP results via Telnet but we weren't reading the stream. Did it just build up in some buffer? How did this cause the camera to lock up?

My guess: the buffer filled up and the result was that the computer's OS stopped sending ACK packets to the camera. The camera then locked up waiting for ACKs that never came.

Kinda straightforward actually.

[+] hax|6 years ago|reply
Indeed. I really feel a more accurate title would've been "My Hardest Bug to workaround without really understanding what was happening".
[+] makach|6 years ago|reply
Weird. He did not understand what caused the bug. To me, it always is about understanding the issue so that we can make sure that it doesn't exist in a similar condition in the code; Just removing the bug is half the job of debugging, the other part is understanding it - this is also to me where I have my biggest a-ha moments.
[+] nighthawk648|6 years ago|reply
It sounds like because they were not processing the data there was some queue of files that were building up until the buffer was full. Once full buffer the request would be denied by their transport protocols, maybe a security feature run ary?

It seems once you unplug either via Ethernet or hard reset the pointer to the MAC address / up would get refreshed thus the full load dumped. Which would be consistent as to why in 2.5 hours the bug would persist again.

I agree it’s funny the author never found out what caused the bug.

[+] pbadenski|6 years ago|reply
Broken printing in an enterprise Java app - it was relatively serious as customer's business process was relying on collaborating using prints. There was no message, no error & it was only failing for one specific customer.

It took us 2 weeks to found the problem. From what I remember some entry in the middle of windows PATH variable had a problem. While loading DLLs this caused a silent failure, and caused printing DLLs not to be loaded. It was all in Citrix environment which definitely did not help.

[+] axilmar|6 years ago|reply
My hardest bug ever was the following:

A weapon's camera tracked the simulated target perfectly, but there were occasional hiccups, where the camera jumped randomly.

The camera rotation data were confirmed, through Wireshark, to be valid, except for very specific values.

It turned out to be the following problem: a function that converted the endianess of float32 values returned the endianess-switched value as a result, rather than change the destination buffer directly.

Somewhere in the process, the x86 hardware expanded the switched float value to 80 bits, stored it in the internal FPU stack, and then this value was retrieved from the stack as a float32 and send over the network.

In some cases, this conversion altered the value in such a way that the camera jumps were not frequent.

The code that did this conversion was deep into a library that has been given to us and that it was widely used in the contractor's company for all Simulation needs.

The problem got away when I replaced the PDU handling with custom code which switched endianess of floats in an appropriate manner (i.e. directly to the destination buffer).

The problem most probably was created because the 'standard' functions for integer endianess swap have the form

    long ntohl(long v);
    short ntohs(short v);
and the author of the library thought that following:

    float ntohf(float f);
was going to be the same as the above.
[+] seanhunter|6 years ago|reply
My hardest ever bug to debug was a core dump deep in the guts of some nasty c++ code.

What's so hard about that? Just load the core dump into a debugger! Well loading the core dump into the debugger caused the debugger to dump core...

[+] chopin|6 years ago|reply
I can relate. An older version of the Eclipse debugger would run into a stack overflow when the debugged process had a stack overflow and you paused the affected thread. It took me a while to figure that one...
[+] TeMPOraL|6 years ago|reply
Not the hardest bug I've had, but working in Common Lisp I've run into macros that crash the runtime (SBCL) when you attempt to see what they expand into. Makes it difficult to debug them.
[+] jiveturkey|6 years ago|reply
similar here. bug in jvm ref counting. not only did the java have to be debugged, the jvm itself then had to be debugged. second hardest bug i ever had to deal with.

first hardest was a linux kernel crypto bug.

[+] taneq|6 years ago|reply
My hardest in terms of pure awfulness wasn't a single bug, but was a constellation of crashes in a commercial MMO game engine (which shall remain nameless). From the architecture of the engine it seems to have started life as a model viewer, then they'd stuck that code in a for() loop to display multiple models, bolted on SpeedTree, mashed the whole resultant mess through Umbra (we can occlusion cull, YAY!), and called it a AAA game engine.

And then it got really ugly, because MMO engines need to be able to stream content from disk.

So they took the model loading code and put it in its own thread, and added a few bLoaded flags to various structs to let the renderer know which of the models, textures etc. had finished loading. And called it done.

It kind of worked.

Along with a couple of other coders I was handed this and a bug tracker full of "Was doing <random thing X> and the game crashed" bugs. To make it worse, there was some obscure build issue that meant switching between Debug to Release build required a full rebuild, which took about 25 minutes. I spent about six months trying to add sufficient synchronization to make it stable. By the time the company folded it was playable, but there were still race conditions and it was still a bit flaky, and to this day I still don't like threads unless they're really necessary.

--

I think the trickiest bug I ever had, in the spirit of the article, was when I was commissioning a piece of mining equipment. It'd been shipped with the controls hardware in a half-finished state and I had to rebuild chunks of it onsite, which meant over a kilometer underground. Eventually I had it all running as intended, it worked perfectly all day, and I thought we were done. Then just as we were packing up for the day, the machine started shutting down. No alarms, no faults, no visible cause, it'd just... stop. I spent an hour pulling my hair out before we had to leave for the day due to scheduled blasting. On the drive back to the surface, I was going over code to figure out what could possibly cause the shutdowns.

Halfway up, it hit me. The only way that the system could stop without a fault was if the 'run' signal from the radio control system was turning off. And the only way that would happen is if the radio transmitter turned off or lost comms. And that's when I realised I'd never fitted the antenna to the radio receiver. So when the battery in the radio transmitter started wearing down after a day of use, the signal strength started getting just low enough for the connection to drop and the machine to shut down.

[+] ggambetta|6 years ago|reply
Oh, I have a couple of fun ones.

"The most impossible one" was during my first "professional" job, while still in college. I was writing part of the GUI of an app - a screen that showed a grid with controls and data. This was ~20 years ago, so C++ and MFC; the control was, IIRC, a regular listbox with some custom drawing.

We got a bug report from an user that was something like "if I do this and that, the empty cells become '3'". Some of my very senior colleagues had drilled into me that CUSTOMERS ALWAYS LIE, so my first reaction was to be sceptical. This was, after all, an impossible bug - most of these cells were absolutely empty (as in, it was impossible for them to have content), and the action the user was describing was absolutely unrelated to this control anyway.

So I tried to reproduce the bug, and to my utter disbelief, the blank cells now showed a "3".

After much debugging, I found the root cause. I was using some API call to lock the char* buffer of an entirely unrelated CString, for some direct manipulation. IIRC the second argument was the length of the string, but there was the bug: I incorrectly thought this arg was the offset, not the length, so I was passing 0 there.

Somehow MFC must have been doing something like interning strings, because by locking this CString of size 0, MFC was giving me a pointer to the char buffer _of the application-wide empty string_ - and I was writing a "3" there for unrelated reasons. So every empty string in the program (or at least for some controls) now was "3".

The fix was trivial once I identified the bug, but I always remember this as the most impossible "can't happen" bug that did, in fact, happen.

-----

"The funniest bug title" was "HANDBAG TURNS INTO A BANANA". By then I was making games (http://www.mysterystudio.com). This was one of the Sherlock games, IIRC, of the "hidden object" genre - there's a list of objects you have to find in a very cluttered scene. When you click on the object, it flies to your inventory. In particular, if the object was partially obscured by the scenery or another object, it would appear unobscured (think "in the foreground") while it flew to the inventory.

To do this, we had a static background with a bunch of "filler" objects, but no "target" objects. Then, for each "target" object, we had a partial (P) and a full (F) image. We'd show the P image of every object, then when the player clicked it, we'd show the F image and animate it to the inventory.

What the tester was reporting was that there was a handbag that, when clicked, turned into a banana.

The root cause was again trivial. For some reason, the artist had made a naming mistake while exporting assets, so we had a P asset of the handbag, but its matching F asset was a banana.

-----

"The most difficult to debug" was at Improbable. In the very early days, part of the world simulation code used Scala and Akka, with a pubsub system (this has been rewritten using more sensible technologies and architectures). During one of our nightly stress tests, we had a bunch of entities essentially walking the extents of the world, looping endlessly for hours. And after hours of running, some entities (but not all) would get stuck, as in "stopped moving", at the boundary of two contiguous simulation regions (still very likely in the same machine).

What followed was a three-week investigation that truly tested my sanity. This was a massively distributed system, and the bug was very hard to reproduce. I spent countless hours staring at logs, diffing logs from "good" runs and "bad" runs, trying to reduce the scenario to isolate the bug. Everything was suspect at some point or other, from our pubsub implementation, to the implementation of hash maps in the Scala standard library.

In the end it turned out to be an extremely subtle race condition in some cache somewhere in the pubsub subscription and desubscription logic. It took 3 weeks of suffering, doubting my own sanity, and being tempted to abandon it (at that point, we were considering just rewriting a big chunk of the code, just to nuke this bug from orbit). I'm glad I persevered and fixed it - it would have haunted me to this day if I hadn't. I still remember its name, and I probably always will. F* you, ENG-168.

[+] dimman|6 years ago|reply
Long story short: Unexpected OpenSSL hard crashes in our application. Turns out our HW was reporting support for unaligned access where as it was actually disabled in CPU due to buggy hw (arm platform).
[+] saagarjha|6 years ago|reply
> Most people like to regale war stories of a particular missing semi-colon

I’m curious if anyone has any real stories of a missing semicolon causing a problem in their running software.

[+] OskarS|6 years ago|reply
I've had a compiler error that was very sneaky and only happened in certain build configurations. Because it was a compiler error, it was caught before heading into production, but it happened just before the final build, so it lead to a stressful hour or two for me ("Hey, the thing we're supposed to ship very soon fails compilation on the platform we're shipping it on!"). Here is a much simplified version of the issue:

    doSomething(),
    DEBUG_MACRO("Print some text");
And mistakenly put a comma instead of semicolon at the end of the first row. I didn't notice the typo, because it compiled and ran just fine: the thing on the left of the comma and the thing on the right were both expressions, which is valid with the C "comma operator". But, on some build settings, "DEBUG_MACRO" became an if statement (it expanded to something like "if (debuggingIsOn) ..."), which is not an expression and thus not valid with the comma operator.

Lesson learned: always build and test all build configurations continuously during development. Don't leave it to the last minute.

[+] dcminter|6 years ago|reply
At university (a long time ago) on my first excursion with the Modula II compiler on the Vax, I accidentally inserted a # as the first character in a long source file. The compiler dutifully reported every line in the file as being in error.

I was inexperienced at reading compilerese (and source control and diffing were not commonplace at that time) and it took me a good day to extricate myself from my bewilderment.

I'm still mildly amused that such a small change to a more-or-less working codebase produced such generous output...

[+] reptation|6 years ago|reply
Not a semi-colon but -

Setting up a GPO to push Microsoft updates to clients in an AD domain, I spent ~8 hours debugging why the clients couldn't connect to the WSUS server. It turned out that there was trailing whitespace in the server URL text field that was getting parsed.

As a side note, this type of story IMO doesn't actually play that well in interview situations.

[+] StillBored|6 years ago|reply
Yes, although I group it as "misplaced/missing" semi-colon. Although the common case of mixing a control statment/block open/close with the misplaced semi-colin generaly seems to be a pretty easy bug to find unless its buried in firmware or some other hard to debug place. OTOH, most decent static analysis tools will warn on constructs with unusual semi-colon indent/block open sequences these days.

AKA

  if ((unusual condition) || (something other condition)); {
     do_something(); 
  }
will buy a nice fat warning the same as using "=" in a control statement. Bonus if the semi-colon comes from a macro so its not actually visible in the code sequence.
[+] bleuarff|6 years ago|reply
Not a real story, but in theory a missing semicolon in javascript can lead to unexpected results. The semi-colon is optional but missing it will cause issue if the following line starts with an opening bracket (and maybe other chars too). For instance:

    const foo = [['bar'], ['baz']];
    [0, 1].forEach(x => x*x);
foo will be the array, as expected. But forget the semicolon and foo is assigned the return of forEach, a.k.a undefined. The syntax is valid, but is not evaluated to what you would naively expect.
[+] flukus|6 years ago|reply
I don't think I've had it happen in running software but back when I was learning I wasted many hours trying to find where my error was with super unhelpful c++ errors, usually to find the problem was the crappy books I was learning from. In running software I've come across extra semi-colons causing issues, like when c and bash are mixed up:

  if (foo);
    fooBar();
Fortunately I've only wasted a couple of hours on that one.
[+] kabdib|6 years ago|reply
I once spent two weeks tracking down a scheduler bug (it would stop scheduling threads and stall forever) to two assembly-language instructions that were in the wrong order, and would cause a timer interrupt to never fire if there was an interrupt between them. Swapping the two instructions fixed the bug.

So, not a missing semicolon, but pretty close . . .

[+] goshx|6 years ago|reply
Debugging often takes more time than necessary because of flawed assumptions. When you assume something can’t be related and you don’t test the theory, you may end up in a path that will lead you nowhere. I learned over the years to behave in a naive way and always check the assumptions first. That led me to more successes in solving hard bugs than coworkers who I consider smarter than me.
[+] andrey_utkin|6 years ago|reply
Has anybody here solved their actual bugs with record & replay technology, aka reverse debugging, like Undo, rr, MS TTL etc? Would love to hear the stories.
[+] zedware|6 years ago|reply
Old gcc/g++ doesn't assure the thread safety of static variables. And I have a core dump triggered by it.
[+] smabie|6 years ago|reply
If an interviewer asked you what you're hardest bug is and you mention a missing semicolon, he should probably just stop the interview right there.
[+] ken|6 years ago|reply
A missing semicolon has an entirely different connotation for C programmers and Lisp programmers.
[+] jiveturkey|6 years ago|reply
and hire you? hard to understand your intent. a missing semicolon can indeed be a stupid hard bug to find.