Ask HN: What was the worst bug you've ever solved ?

[+] mooders|16 years ago|reply

Back when text messaging capability was a rarity on mobile phones, which were themselves rare, I was testing an SMS-based weather forecast service that I had written on behalf of one of the mobile network operators.

The testing worked well on the emulator so I decided to test it over the public network to an actual handset. Only I forgot to advance a recordset through which I was looping, so the code never hit the end of recordset condition. It took me some time to notice there was a problem...

The fact that I crippled a national SMS network for a few hours was bad.

The fact that my company had to pay for each SMS, wiping out out profit for that month was worse.

The fact the handset was mine and on my first date with a girl later that evening (whom I later married) my handset kept beeping with incoming text messages (about 96,000 if I remember) was the ultimate.

The handset didn't have a silent-no vibrate function (either it beeped or it vibrated or it did both) and the SMS inbox filled up after 200 or so messages meant it took days for the inbox to fill up, me to clear it message by message, then fill up again ad nauseam.

Still, I laugh about it now...

[+] pgebhard|16 years ago|reply

You couldn't just turn it off for the duration of the date?

[+] ajju|16 years ago|reply

One of my colleagues did this with our automated notification system. After his phone received about $60 worth of text messages, he panicked and shut down the server!

Then again $60 is only about 1200 messages.

[+] gdp|16 years ago|reply

I inherited a giant hideous stock-management system. It did a certain amount of automated ordering without manual intervention.

Long story short: a nasty race condition meant that it was over-ordering duplicate products from the suppliers to the tune of tens of thousands of dollars per day.

On the general theme, my most frightening software experience was when I met a guy who was the star programmer for a company doing controllers for elevators. I got talking to him, and he showed me some code. It took me about 15 seconds to identify an edge case that would engage the motor while the elevator was at the top floor (thereby attempting to pull the elevator into the roof?). It took me a further hour to explain what the edge case was. The bug wasn't that scary, as I'm sure there are hardware failsafes, but the general dimness of the guy writing software to control lifts was scary.

I started taking the stairs after that.

[+] cmos|16 years ago|reply

In high school I got a job for my local Department of Public works in the Power division. I lived in a small New England town that did their own power, much like most towns do their own water and sewer.

My job was as an assistant to the inventory guy, a 70 year old feisty man with one hand named Al.

I was often bored when Al would tell me to 'go hide somewhere' so I wrote some software to help him manage the inventory system. The power engineers in charge saw this and after a few small programming assignments had me work on updating the newly installed SCADA control system. This was a specialized programming environment that controlled all the power in the town.

We were setting it up to buy power from the local college during the yearly 'peaks' in August, thus reducing our yearly electrical bill by potentially millions of dollars.

After a month of working on it and incrementally adding my changes I screwed up. I knew this when I submitted a change and all the alarms went off at the substation.

Half the towns power was out. I got it back on after an hour, and nobody called with ventilator issues, so I think there was no real harm done.

The engineer in charge of the department laughed it off when he saw my apprehension about the situation. He said in the grand scheme of things they have made far bigger mistakes than that, probably referring to the blown transformer a couple months earlier.

[+] imp|16 years ago|reply

That must have been one scary hour.

[+] edw519|16 years ago|reply

A batch run of only a few thousand items was running all night long, rarely finishing and causing all kinds of problems when people logged in in the morning. The users had been complaining about this for years.

I was given the ticket and found a "SLEEP 10" (causing a 10 second pause for each item) in the 10 year old BASIC code, put in by the original programmer for debugging purposes, and never removed before it was promoted.

I removed the "SLEEP 10" and run time went from 12 hours to 23 seconds.

The users loved me, but my boss was not pleased. He said, "You should have changed it to a SLEEP 5 so we had something else to give them the next time they complained."

[+] bendtheblock|16 years ago|reply

Perhaps it was intentional on the original programmer's part... but he never got round to reducing the sleep time - http://thedailywtf.com/Articles/The-Speedup-Loop.aspx

[+] astangl|16 years ago|reply

Reminds me of back in the old days writing integer assembly language (and later) C apps for microcomputers. As FPUs started to become available, users complained they spent all this money on an FPU and got no speed improvement.

I half-seriously suggested adding FPU detection and intentional delay loops if no FPU was present.

[+] ilitirit|16 years ago|reply

Not so much a bug, but this incident cut my lifespan by about 5 years or so:

I was fixing an account balance on a customer's master database. Any change that happens on the master gets replicated to 30 branches, usually at 10 minute intervals.

I wrote the UPDATE statement, highlighted it, and pressed "Execute". Unfortunately, I didn't select the WHERE clause, so I basically gave all their customers (85000) a 50c credit balance. The IDE I was using also had a bug that caused it to ignore the auto-commit setting (which was turned off), so it basically committed the transaction. First I tried a ROLLBACK, which failed obviously. Realizing I had to act quickly, I disconnected the network cable (I was working on the DBMS server) to stop replication. I extracted the transaction log from the current database into a textfile (a few hundred MB) and restored the database from the most recent backup. Then I basically ran the extracted transaction log as SQL scripts against the database, hoping that it wouldn't fall over. I didn't want to do a normal restore because I was afraid of it showing up in the logs.

Within a very stressful 60 minutes I had everything back to normal. I never told a soul IRL about it.

[+] jacquesm|16 years ago|reply

Which is one of my pet gripes with SQL, that the default is to affect all records on update, that should have been:

UPDATE tablename SET xxx='newvalue' WHERE ALLRECORDS;

Or something to that effect.

[+] cake|16 years ago|reply

I had one of these too. I had to UPDATE a line of a few customers order's logs on a production server. The database was huge (several million lines).

After a few seconds I thought it was taking too much time only to realise I forgot one AND in the WHERE... I struggled with the backup but managed to get the data back just after the first call.

[+] Calamitous|16 years ago|reply

I've done almost this exact thing. The "run only the SQL that's highlighted" feautre is handy, but it can cause serious bad juju on a production box.

Unfortunately, in my case, I ended up having to rebuild all the data from week-old backups and transaction logs.

[+] Kirby|16 years ago|reply

For a different definition of worst:

I started a job recently at an ecommerce company. There was a long-standing bug with the cart display in the upper right of the page always saying that the cart was empty. People would report it all the time, and the quite smart lead programmer said it was something really complicated that he hadn't had time to investigate.

But eventually, after I sort of knew my way around the code, and when I finished up all the tasks on my to-do list, he handed me that as a why-not-investigate sort of thing. He didn't really have any idea, just that the previous long-gone coder had said it was complicated and in the depths of the way the front end code interacted with the order system.

So, I reproduce it, and look at the template code. These two lines, right next to each other:

[% cart_summ = ourdb.cart_summary %] [% IF cart_cumm.qty > 0 %]

Note that the two variables don't match. And this was broken on the site for _FOUR YEARS_. And nobody looked because someone said he had and it was hard, and nobody had time for a hard problem.

facepalm

[+] tetha|16 years ago|reply

hehe. reasons to develop YOUR_LANGUAGE-lint?

[+] nomurrcy|16 years ago|reply

Some users of a (shipped, fairly heavily used) web app we had deployed were getting kicked back to the login screen at random. Sometimes, very frequently.

Looking in the logs we could see that these users were somehow losing their authentication cookie and the application was correctly bouncing them to login. So how were they losing their cookie? Assuming it was a bug in the code we searched and searched to no avail.

Finally I discovered that the hardware load-balancer our CTO/'IT' guy' had insisted on was the culprit. The load balancer would buffer fragmented requests and re-assemble them before sending them on to the server. Unfortunatly the load balancer had a huge bug in its firmare.

If a user was using firefox, on windows, and their request was fragmented such that the first packet contained data up to the end of a header line including the \r but not the \n, so the next packet would start with a \n and then a header name, the load balancer would insert a \n\r between the two packets, thus effectively truncating the HTTP headers, usually before the cookie lines.

When I found this bug I couldn't believe that this was actually happening, I thought I was taking crazy pills, but you could run a sniff on the front and back side of the load balancer and see the request go in well formed and come out busted. We ditched the hardware load balancer and all was well.

[+] tjr|16 years ago|reply

Was converting an avionics subsystem from Ada to C. It was a client application that had to talk to an Ada server, sending and receiving rather huge chunks of data, large, deeply nested, intricate structure types. The C structure type had to match the Ada type exactly, or else it wouldn't work.

I got it working fine on our desktop simulation, but running on the actual hardware it was consistently off. After extensive testing, I realized that it was a bug in the compiler for the target hardware, such that a very particular type of structure (something like, {int, char, float}) was being packed incorrectly, resulting in a 2-byte pad that shouldn't be there. If I reordered the structure elements, it was fine, but that particular grouping and order refused to work correctly.

It was GCC, so we could fix the compiler ourselves, right? Not really, as, for avionics systems the compiler has to be thoroughly qualified for avionics use, and changes equal requalification. I "fixed" it by storing the float as an array of characters, converting it to and from a real float type as we needed to use the data value.

Trivial, perhaps, but I was very excited to resolve the problem, after spending days barking up wrong trees. One usually expects that the problem is not in the compiler... :-)

[+] vicaya|16 years ago|reply

Were you using GCC's __attribute__((__packed__))?

Anyway, the standard way (to handle protocols) is to parse the thing not making any assumption of the struct layout.

[+] f00|16 years ago|reply

Bug with the most spectacular results:

As a (former) hardware engineer, I've worked on many projects where bugs have physical effects. This can range anywhere from amusing to seriously dangerous.

One such bug involved a mistake in the assembly diagram and silkscreen for a circuit board. The result was that a tantalum capacitor was installed backwards on a 12V supply rail.

Tantalum capacitors are polarized, and they fail in a spectacular way when reverse-biased. In this case, the supply rail could source upwards of 20A, so the fireworks were loud and impressive. Luckily the cap was easily replaced and the only permanent damage was cosmetic.

Hardest-to-troubleshoot bug:

In my subsequent return to the world of software, I worked on device drivers for network interfaces (among other things).

NICs frequently operate through a circularly-linked list of packet descriptors, which contain pointers to buffers in RAM where the NIC can DMA packet data. The hardware fills the DMA buffers and marks the descriptor as "used," and the driver chases the NIC around the ring, processing the packet data and marking the descriptors as free.

In testing, we discovered that under long periods (hours, usually) of heavy load, the system would occasionally freak out and stop processing packets. Sometime later, various software modules would crash.

Working backwards through the post-mortem data, I saw that the NIC would get "lost" and dump packet data all over system memory. I dumped the descriptor ring (tens-of-thousands of entries) and wrote some scripts to check it for consistency.

To make a very long story short, when the NIC was stormed with lots of 64b packets with no gaps, it would eventually screw up a DMA transfer and corrupt the "next" pointer in the descriptor ring. On the subsequent trip through the ring, the NIC would chase an errant pointer off into system memory and corrupt other system data structures.

Since hardware can DMA anywhere in RAM, the OS is powerless to stop it. The resulting errors can be ridiculously hard to track down and fix.

[+] tedshroyer|16 years ago|reply

Had an obscure picking id wrap around because a table wasn't getting cleared for debugging purposes which resulted in excessive amounts of beer being delivered to unsuspecting customers at an automated gas station.

Here's a video of part of the result: http://www.youtube.com/watch?v=RUhLDtPnSuQ

[+] ajju|16 years ago|reply

A bug that gives out free beer. Talk about a bug with a silver lining :)

[+] mrduncan|16 years ago|reply

If you're a customer, that's a feature not a bug!

[+] tlb|16 years ago|reply

Runaway robots at Anybots have caused:

  - 2 holes in drywall
  - 1 bent bookshelf
  - 1 dent in concrete floor
  - 1 frightened Jessica
  - http://www.youtube.com/watch?v=qkenIInV9rI

The last one was fun because I have logs showing packets from the PC/104 computer stack (running FreeBSD) connected to the robot while it was in midair.

[+] jacquesm|16 years ago|reply

How do you dent a concrete floor with a robot ?

[+] __david__|16 years ago|reply

The most memorable bugs are the ones that cause physical damage. This was mine:

http://www.youtube.com/watch?v=b7i2KkYYulI

Damage: Blown tire, dented rim, looking like fools in front of our peers.

[+] elblanco|16 years ago|reply

Anytime the effects of code escape into the real world, the result is far more interesting.

[+] plaes|16 years ago|reply

Um.. what was that? Some autonomous vehicle navigation system?

[+] shelfoo|16 years ago|reply

Two immediately jump to mind. One that had a massively bad impact to the company, another that might have..

First, using perl a (later-fired) co-worker added a hardcoded check like the following:

if ($client_id = "specific_id") { #email reports }

Needless to say, we emailed reports for all of our clients to a specific client, didn't go over too well considering that many of them were competitors.. It was particularly bad because he had previously been talked to about flipping the constants to avoid the = vs == bug.

Second, possibly abused but not known for sure, was found a few years after initially being put out. Our webapp created a session ID for each user, MD5 hash.

Except it started like: StringBuffer md5HashedBuffer = new StringBuffer(userId);

Which, because the userId was an int, simply creates a string buffer of size userId, not a string buffer initially populated with userId.

The rest of the hash was added afterwards, then the one-time created, with the result that everybody's session id was the same. Changing your user id in the GET or POST would allow you to be logged in as a different user.

[+] mleonhard|16 years ago|reply

Why weren't both of these bugs caught during code reviews?

[+] Mark_B|16 years ago|reply

A while back, I developed a program to generate invoices for about a dozen busy warehouses. During testing, for convenience sake, I hard-coded in my local printer.

Unfortunately, I forgot to return the printer name to a variable when promoting into production. Hilarity ensued.

[+] jacquesm|16 years ago|reply

Hehe, that one had me laughing here. Ouch. Hope you put enough paper in it ;)

[+] bkz|16 years ago|reply

After having launched our product I was spending some time reviewing commits together with the senior tech lead. Still to this day I can recall the commit number, the filename and write down the code from memory responsible for what turned out to be the source of a bug completely wiping out our users computers. Someone had mixed uncommenting a piece of code together with fixing a bug which hid the fact that some horrible code was active in the product. It took us 5 minutes to produce a fix and push it out to the update servers. Did we end up wiping someones computer? Yup, about a dozen known cases including a couple one in-house. I don't even want to think about how many actual cases there were, considering that we had about 2 million downloads of our product before the bug was fixed.

[+] btilly|16 years ago|reply

How do you define worst?

How about most widespread? Once while trying to debug a CPAN module I figured out that if $condition was false then Perl had a bug causing

my $foo = $bar if $condition;

to leave $foo with whatever value it had on the previous function call. (The exact behavior is more complex than that, but that's a good first approximation.) I then made the mistake of reporting this in a way that made it clear that

my $foo if 0;

was a static variable. Cue years of people like me trying to get the bug fixed and other people trying to keep it around. In the meantime EVERY large Perl codebase that I've looked in has had this idiom somewhere and it has caused hard to notice and reproduce bugs.

How about worst damage to a system? Due to a typo I once caused my employer to send the Bloomberg's ftp system every large file that it had ever sent. Since it sent a large file every day, this crashed their ftp server, meaning that a number of feed providers for Bloomberg didn't have their feeds update that day. I implemented no less than 3 fixes that day, any one of which would keep the same mistake from causing a problem in the future.

How about most (initially) bizarre? Debugging a webpage that a co-worker produced where, depending on the browser window size, you couldn't type into the form. The bug turned out to be a div that had been made invisible but left on top of the page. At the wrong window size it was in front of the form, so you couldn't click on the form elements behind it. (I figured this out by first recreating it with static HTML, then using binary search to figure out what parts of the page I could whack and still reproduce it until I had a minimal example. Then it was trivial to solve.)

[+] jacquesm|16 years ago|reply

That second one reminds me of this: An ISP called planet internet changed their homepage, only to find that they reliably crashed Explorer (3 at the time).

Took a while before the phone rang if I wanted to have a look.

It turned out they had a little animated gif in there with the inter-frame interval set to 0, causing a divide by 0 in Explorer.

That gif was pretty much the last suspect on the list.

Divide & conquer until you are simply staring at the solution and still you don't see it...

[+] synnik|16 years ago|reply

The store locator function on a national pizza chain's web site would completely hang their web server whenever an international search was done. Many, many hours and days of testing and debugging led us to conclude, and build a proof that it was a reproducible bug within IBM's Domino platform, only on AIX boxes, only when: 1) A script using LotusScript, their proprietary language was kicked off, and... 2) A Java agent was then kicked off before the original script completed.

At the time, Java was a new feature within that platform, so there weren't many apps that mized both languages.

After getting to this point, IBM joined in the fix effort, and we had daily conference calls, on which we always had IBM execs lurking because their 6.0 release of the platform was imminent, and this bug had the potential to wreak major havoc if not fixed before launch.

So I cannot personally claim to have done the actual bugfix - the IBM programmer did that. But it was a great learning experience to work together with IBM to find and fix it.

[+] jacquesm|16 years ago|reply

Dominos Pizza :)

[+] masterponomo|16 years ago|reply

The worst bug I encountered was due to IBM MVS (or COBOL--I was never sure which was at fault) losing addressability of part of a variable length record. Now you see it, now you don't. The solution at our shop was to move the whole record to itself before attempting to look into the record. I was a newbie. If the old guard hadn't told me that workaround, I NEVER would have thought of it. This problem eventually went away, but 25 years later we still occasionally ask each other "did you try moving it to itself" when dealing with new problems. We chuckle, while today's newbies shake their heads at our Old Fart humor.

The worst one I ever caused was when Visa started carrying two amount fields in their credit card records. One was the amount in original currency, the other was converted to the receiving system's local currency. I used the original amount. Our hand-made test data used the same currency for both amounts, so no problem in test. Imagine my surprise when we went live and our system started posting original currency amounts to cardholder accounts, which at the time only supported US dollars. Luckily, we caught it early and senior management and cardholders were all good sports about it. I think those credit card statements with massive amounts became collector's items.

[+] dandrews|16 years ago|reply

Since you mention COBOL records were variable-length I'd guess that they probably contained ODOs (Occurs-Depending-On, variable length arrays for those of you who aren't COBOL literate). In order for a group or record MOVE to work properly you had to move the subordinate ODO values first, otherwise the runtime system would miscalculate the target record length, possibly truncating the MOVE.

This also meant you couldn't use READ INTO for variable length records (which is equivalent to a READ followed by a MOVE) without taking some care.

As you say: "newbies shake their heads..."

[+] MrMatt|16 years ago|reply

I worked on a taxi booking and dispatch system, written in c, and running on dos with custom networking via RS232. This system was installed at around 300 locations around the UK, and on one fateful day, every installation crashed.

It came down to me to find and fix the problem, and it was subtle. The clue lay in the fact that all of the sites that crashed did so within about a minute of one another.

Turns out that some of the old, old sections of the software had been written by the MD, who, despite referring to himself as 'the emperor of c', was in fact an atrocious programmer.

The actual trigger was the comms system looking at a byte that determined as to whether a message had been received. This byte was set to the character 'A' if a message was received. It just so happened that the first byte of the current value of the number of seconds since 1970 evaluated to 'A', and had been written into that memory location a negative index into an array that hadn't been initialised.

This negative index into an array that shouldn't have been empty caused a section of memory to be overwritten that made the comms system think that it had received a packet. This snowballed quickly, and took down the system within about five seconds of boot.

Took the best part of two days to track down, and, of course, it was everyone else's fault but the emperors.

[+] duskwuff|16 years ago|reply

Let me guess: The crash occurred slightly before 6 PM on July 22, 2004?

[+] humbledrone|16 years ago|reply

I was working on a C++ daemon process that communicated over a TCP socket. At the time, we were using the Poco library's facilities to do the standard daemon startup stuff (get rid of the controlling pty, point standard fds to /dev/null, etc). Anyway, one of our field installations wasn't working, so I took a look. It turned out that the communications over the TCP socket weren't working -- where the client process expected a few header bytes containing the message length, it was getting wacky values. I tried a bunch of stuff, and in the end, I displayed the header as ASCII, and it showed up as "SQL: INS". This blew me away; this looked like some debugging output that normally goes to standard output when the process wasn't running in daemon mode.

As it turns out, the Poco library didn't read Steven's UNIX book all that closely, and they closed all of the file descriptors when turning a process into a daemon, instead of reopening them to point to /dev/null. So, standard output was closed, and its file descriptor was reused for the TCP socket. Of course, things like "cout" always assume that standard output is at a particular descriptor, so all the standard output from the program was getting written to the TCP socket.

Boy, that was confusing.

125 comments