I've only cried literal tears once in the last ten years, over business. Due to inattention while coding during an apartment move, I pushed a change to Appointment Reminder which was poorly considered. It didn't cause any immediate problems and passed my test suites, but the upshot is it was a time bomb that would inevitably bring down the site's queue worker processes and keep them down.
Lesson #1: Don't code when you're distracted.
Some hours later, the problem manifested. The queue workers came down, and AR (which is totally dependent on them for its core functionality) immediately stopped doing the thing customers pay me money to do. My monitoring system picked up on this and attempted to call me -- which would have worked great, except my cell phone was in a box that wasn't unpacked yet.
Lesson #2a: If you're running something mission critical, and your only way to recover from failure means you have to wake up when the phone rings, make sure that phone stays on and by you.
Later that evening I felt a feeling of vague unease about my change earlier and checked my email from my iPad. My inbox was full of furious customers who were observing, correctly, that I was 8 hours into an outage. Oh dear. I ssh'ed in from the iPad, reverted my last commit, and restarted the queue workers. Queues quickly went down to zero. Problem solved right?
Lesson #3: If at all possible, avoid having to resolve problems when exhausted/distracted. If you absolutely must do it, spend ten extra minutes to make sure you actually understand what went wrong, what your recovery plan is, and how that recovery plan will interact with what went wrong first.
AR didn't use idempotent queues (Lesson #4: Always use idempotent queues), so during the outage, every 5 minutes on a cron job every person who was supposed to be contacted that day got one reminder added to the queue. Fortuitously, AR didn't have all that many customers at the time, so only 15 or so people were affected. Less than fortuitously, those 15 folks had 10 to 100 messages queued, each. As soon as I pressed queues.restart() AR delivered all of those phone calls, text messages, and emails. At once.
Very few residential phone systems or cell phones respond in a customer-pleasing manner to 40 simultaneous telephone calls. It was a total DDOS on my customers' customers.
I got that news at 3 AM in the morning Japan time, at my new apartment, which didn't have Internet sufficient to run my laptop and development environment to see e.g. whose phones I had just blown up. Ogaki has neither Internet cafes nor taxis available at 3 AM in the morning. As a result, I had to put my laptop in a bag and walk across town, in the freezing rain, to get back to my old apartment, which still had a working Internet connection.
By the time I had completed the walk of shame I was drenched, miserable, and had magnified the likely impact that this had on customers' customers in my own mind. Then I got to my old apartment and checked email. The first one was, as you might expect, rather irate. And I just lost it. Broke down in tears. Cried for a good ten minutes. Called my father to explain what had happened, because I knew that I had to start making apology calls and wasn't sure prior to talking to him that I'd be able to do it without my voice breaking.
The end result? Lost two customers, regained one because he was impressed by my apology. The end users were mostly satisfied with my apologies. (It took me about two hours on the phone, as many of them had turned off their phones when they blew up.)
You'd need a magnifying glass to detect it ever happened, looking on any chart of interest to me. The software got modestly better after I spent a solid two weeks on improved fault tolerance and monitoring.
Lesson the last: It's just a job/business. The bad days are usually a lot less important in hindsight than they seem in the moment.
> I've only cried literal tears once in the last ten years, over business.
> Don't code when you're distracted.
Same story here, I can't remember the exact scenario but I was concurrently acting under all three of my titles (Developer, Architect, Escalation Engineer/Critical Debugger). The customer (who was 7 hours different to us) had been battling for 3 months and we were getting nowhere (all thanks to a, since fixed, bug in WinDBG which essentially came down to broken stack traces in certain scenarios), for those 3 months I had been working 20 hour shifts (development by day, support by night).
Under that strain I eventually made a screw up with the dev and it cost QA time. The MD of my region had a sit down with me and I
> cried literal tears
Needless to say they were impressed that I someone cared so much and sent me home to sleep. The next day I came in and decided to go through the 800kloc codebase line-by-line and see what could be causing the issue - I found it in a few hours.
> Fortuitously, AR didn't have all that many customers at the time, so only 15 or so people were affected. Less than fortuitously, those 15 folks had 10 to 100 messages queued, each.
Excuse me for caviling at your post, but "fortuitously" is a synonym for "accidentally", not "fortunately".
oh man. the idempotent queue thing reminds me of the time i was working for a very-early-stage startup, and we'd managed to persuade an exec from a pretty big company to sign up for a trial. our ceo got a very angry call the next morning; the guy had woken up to 300 email messages because our queue had hiccuped and, yeah, not idempotent.
Not the worst at all, but probably one I found most amusing. One of my jobs included some sys admin tasks (this wasn't the position, but we all did dev ops), among my other responsibilities. I spent half a day going through everything with the person responsible for most of the admin tasks at the time. She was an extremely dilligent and competent admin, did absolutely everything through configuration management and kept very thorough personal logs and documentation on the entire network. One of my first tasks was to change backup frequency (or other singular change) and going by how I usually did things at the time, just sudid a vi session, changed the frequency and restarted the service.
She found out about it pretty quickly due to having syslog be a constant presence in one of her gnu screen windows and gave me a look. She quickly reverted what I did, updated our config management tool, tested it, then deployed it, while explaining why this was the right way to do things. I slowly came around to doing things the right way and haven't thought much about the initial incident until we found her personal logs that she archived and left on our public network share for future reference.
In the entries for the day that I started, we saw the following two lines:
[*] 2007/09/09 09:58 - yan started. gave sudo privs and initial hire forms.
[*] 2007/09/09 10:45 - revoked yan's sudo privs.
She found out about it pretty quickly due to having syslog be a constant presence in one of her gnu screen windows
I'm amazed that this is possible. How would I set something like that up? A realtime log of only the most significant events of a remote system?
In fact, I'd like to take this opportunity of ignorance-admitting to ask the community for general linux/bsd sysadmin resources. What books should I read, or what topics should I study? I want to become an expert at modern sysops. Modern deployment, hardening, backup, managing dozens of boxen, etc.
I've been thinking of going through any MIT OCW on the subject, but it seems like hard-earned experience might not necessarily translate well to an academic setting. What would you recommend I do?
In late 2008 when I was in the Marines and deployed to Iraq I was following too closely behind the vehicle in front while crossing a wadi and we hit an IED (the first of 3 that day).
Nobody was killed, but we had a few injured. Thankfully the brunt of it hit the MRAP in front of us. If it hit my vehicle (HMMWV, flat bottom) instead I probably wouldn't be here.
That was the first major operation on my first deployment, too. Hello, world!
My takeaway? Shit just got real.
We ended up stranded that night after the 3rd IED strike (our "rescuers" said it was too dangerous to get us). It was the scariest day of my life, but in similar future situations it was different. I still felt fear and the reality of the existential threat, but I accepted it. It was almost liberating. Strange.
I deployed for another year after that (to Afghanistan that time). After Afghanistan I left the Corps and started my company. Because if it fails, what's the worst that can happen? Lulz.
This really puts some of the boneheaded moves I've made in my career in perspective. One thing that's always kept me pretty even keeled after a blowup is to take a breath and tell myself that no matter how bad I've screwed up, I'm still here, still breathing, and there (most likely) is some way out of the hole I've dug, no matter how painful.
Depending on the industry, that might not be the case though. Thanks for your service.
One summer in college, I got an internship at a company that made health information systems. After fixing bugs in PHP scripts for a couple weeks, I was granted access to their production DB. (Hey, they were short on talent.) This database stored all kinds of stuff, including the operating room schedules for various hospitals. It included who was being operated on, when, what operation they were scheduled for, and important information such as patient allergies, malignant hyperthermia, etc.
I was a little sleepy one morning and accidentally connected to prod instead of testing. I thought, "That's weird, this UPDATE shouldn't have taken so long-oh shit." I'd managed to clear all allergy and malignant hyperthermia fields. For all I knew, some anesthesiologist would kill a patient because of my mistake. I was shaking. I immediately found the technical lead, pulled him from a meeting, and told him what happened. He'd been smart enough to set up hourly DB snapshots and query logs. It only took five minutes to restore from a snapshot and replay all the logs, not including my UPDATE.
Afterwards, my access to prod was not revoked. We both agreed I'd learned a valuable lesson, and that I was unlikely to repeat that mistake. The tech lead explained the incident to the higher-ups, who decided to avoid mentioning anything to the affected hospitals.
If it's any consolation, the company is no longer in business.
Just remember when you screw things up: Your mistake probably won't get anyone killed, so don't panic too much.
You didn't screw up here. The entire infrastructure, org chart, and policies that allowed you to accidentally modify a production database containing critical medical information screwed up.
Blaming yourself here is like blaming yourself for being hurt after being told to drive a car with no seatbelt or brakes.
Uh, has anyone on this thread heard of HIPAA? I'm pretty sure having a summer intern get full access to actual patient data shouldn't be possible under a properly implemented set of HIPAA processes, and the same goes for the accidental UPDATE.
The story reminds me of the day I was introduced to "BEGIN TRANS", "COMMIT" and "ROLLBACK" when someone upgraded the Sybase console and helpfully changed the default setting so we didn't need those pesky semi-colons to finish a query any more. The result was:
DELETE * FROM TABLE x
131054 rows deleted
WHERE a = "foo"
>> Malformed query <<
Phone starts to ring a few seconds later as all the users saw their morning's work disappear.
This stuff is way too easy for us noobs. Thank goodness that with modern technology we've found ways to make sure it doesn't happen any more... :-)
Did a similar thing, but in a less critical domain (warehouse management). Updated the status of all packages to "NEW", which would have meant that everyone who ever ordered something from that company would have gotten another delivery for free, provided the articles were in stock.
We were able to restore the data pretty quickly, but we had to interrupt warehouse workflow for a few minutes. They were surprisingly accommodating, almost amused by my mistake.
A local Subway franchise was the very first company that hired me. I was extremely young, shy, and intensely socially awkward, yet excited to join the workforce (as I had my eyes set on a Pentium processor).
When I worked at Subway, the bread dough came frozen, but you would put loaves in a proofer, proof it for a certain amount of time, and then bake it. My first shift, however, got busy and I left several trays in the proofer for a very, very long time. Consequently, they rose to roughly the size of loaves of bread, as opposed to the usual buns.
It was my very first shift alone at any job in my life, so I did the most logical thing I could think of and put the massive buns in the oven. They cooked up nicely enough and I thought I was saved. Until I tried to cut into one.
Back in that day, Subway used to cut those silly u-shaped gouges out of their buns. In retrospect, I think this was most likely a bizarre HR technique designed to weed out the real dummies, but at the time I was oblivious (likely because I was one of the dummies they should have weeded out). When I ran out of the normal bread, I grabbed one of my monstrosities, tried to cut into it, and discovered that it was not only rock hard, but the loaf broke apart as I tried to cut it.
That night, my severe shyness and social awkwardness had their first run-in with beasts known as angry customers. I was scared I would get fired, so I promptly made new buns, but spent the rest of my shift trying to get rid of my blunder. I discovered some really interesting things about people that night. First, you'd be surprised how incredibly nice customers are if you are straight up with them. Some customers I never met before met the big, crumbly buns as an adventure and, in doing so, helped me sell all the ruined buns.
In the end, I came clean (and didn't get fired). That horrible night was a huge event in the dismantling of my shell. It taught me an awful lot about ethics. And frankly, that brief experience in food service forever changed how I deal with staff in similar types of jobs.
This reminds me of reject analysis week as a radiography student. People would be hiding their crap films (film and chemistry people!) up their tops, behind shelves, basically anywhere. Now days the clever ones know how to dick with the server. I have never deleted films for this reason, but have deleted films to keep incidents quiet.. (Boss must not know I got a chunk of steel in my hand prior to a shift in MRI etc)
I was testing disaster recovery for the database cluster I was managing. Spun up new instances on AWS, pulled down production data, created various disasters, tested recovery.
Surprisingly it all seemed to work well. These disaster recovery steps weren't heavily tested before. Brilliant! I went to shut down the AWS instances. Kill DB group. Wait. Wait... The DB group? Wasn't it DB-test group...
I'd just killed all the production databases. And the streaming replicas. And... everything... All at the busiest time of day for our site.
Panic arose in my chest. Eyes glazed over. It's one thing to test disaster recovery when it doesn't matter, but when it suddenly does matter... I turned to the disaster recovery code I'd just been testing. I was reasonably sure it all worked... Reasonably...
Less than five minutes later, I'd spun up a brand new database cluster. The only loss was a minute or two of user transactions, which for our site wasn't too problematic.
My friends joked later that at least we now knew for sure that disaster recovery worked in production...
Lesson: When testing disaster recovery, ensure you're not actually creating a disaster in production.
I wrote a piece of code controlling an assembly line machine. These machines require manual operation, and would come with a light curtain, which detects when someone places their hand near the moving parts, and should temporarily stop the machine.
A relatively minor bug in the software that I wrote caused the safety curtain to stop triggering when a certain condition was met. We discovered this bug after an operator was injured by one of these machines. Her hand needed something like 14 stitches.
Lessons learnt:
1. Event-driven code is hard.
2. There's no difference between a 'relatively minor' bug and a major one. The damage is still the same.
Another lesson your company should have learned is that a safety-critical system like this should not be left to software. Sure, monitor the curtain by software and send errors, but hardware should immediately stop the machine when the light curtain is broken.
Classic forgetting the full WHERE-part of a manual UPDATE-query on a production system. The worst part is you know you fucked up the nanosecond you hit enter, but it's already too late. Lesson learned? Avoid doing things manually even if a non-technical co-worker insists something needs to be changed right away. And if you do: wrap it in a transaction so you can rollback, leave in a syntax error that you'll only remove when you're done typing the query.
I was hired by my college to build a grade management system in my second-to-last year there. I was in a hurry due to a lunch meeting with other IT staff at the University, forgot to add the where clause, and suddenly every single student was a Computational-Science major (mine).
Funny part of the story was that the moment it happened I uttered "oh shit." My boss, who sat beside me, said "what'd you do?", and about 15 IT staff from other departments walked into the office to go out for lunch. I'm sure I was an interesting shade of red.
I had to explain what I did in front of all these people. My boss laughed out loud, brought the system offline, and simply said: "well, after lunch we get to test our backup process." We went for lunch.
Two valuable lessons I learned...
People make mistakes, that isn't a problem, it's how they respond that's important.
Don't try and solve hard problems when emotions are running high. If shit is going down in production, the most important thing to do is to breathe, and get a glass of water. That little bit of time helps a lot.
This is why, while I hate Oracle and everything they represent as a company, I kind of like their database because of the flashback feature. You can do
SELECT * FROM table AS OF TIMESTAMP some_timestamp;
and that is pretty practical. It works online, no restore, no nothing, and while it only works as long as the old data are in logs, on a production system, you should have the spare space to have some history. Theres also FLASHBACK TABLE tab to BEFORE DROP but that shouldn't happen, right?
Of course, you should probably do every update of production data in a transaction, check the result and then commit, and if you want to be sure, you can do UPDATE ... RETURNING to check what's changing. Autocommit on manual access to production is pretty crazy. But still, flashback is useful.
Reading all of these makes me think, the admin tool for your database of choice should probably put you inside a transaction by default, and require you to explicitly commit changes. For the madmen, it could still have an auto-commit mode, but should be opt-in rather than the default.
I've done similar and now I almost always write a select first and then only after I've verified I'm getting the rows that I expect do I update my query to an update/delete.
In this case though, wouldn't you have to COMMIT before the actual update happens ? Usually in production, it is not a good idea to have auto COMMIT on.
Been there done that. Usually I always work inside a transaction, and carefully examine the results before typing that all important 'commit'. But a "simple" change at 4:55 and me in a hurry to get home....
This is why you have SET SQL_SAFE_UPDATES=1; (or equivalent) in your DB shell startup. It only takes one UPDATE users SET password='foo'; to learn why...
I did this in a production database (thought it was a QA environment) and brought trading on the mortgage desk of an investment bank to a grinding halt on September 14th, 2008.
The DBAs saved my 23 year old ass that day. I make it a point to send them beer on 9/14 every year.
I run Correlated.org, which is the basis for the upcoming book "Correlated: Surprising Connections Between Seemingly Unrelated Things" (July 2014, Perigee).
I had had some test tables sitting around in the database for a while and decided to clean them up. I stupidly forgot to check the status of my backups; because of an earlier error, they were not being correctly saved.
So, I had a bunch of tables with similar names:
users_1024
users_1025
users_1026
I decided to delete them all in one big swoop.
Guess what got deleted along with them? The actual users table (which I've since renamed to something that does not even contain "users" in it).
So, how do you recover a users table when you've just deleted it and your backup has failed?
Well, I happened to have all of my users' email addresses stored in a separate mailing list table, but that table did not store their associated user IDs.
So I sent them all an email, prompting them to visit a password reset page.
When they visited the page, if their user ID was stored in a cookie -- and for most of them, it was -- I was able to re-associate their user ID with their email address, prompt them to select a new password, and essentially restore their account activity.
There was a small subset of users who did not have their user IDs stored in a cookie, though.
Here's how I tackled that problem:
Because the bulk of a user's activity on the site involves answering poll questions, I prompted them to select some poll questions that they had answered previously, and that they were certain they could answer again in the same way. I was then able to compare their answers to the list of previous responses and narrow down the possibilities. Once I had narrowed it down to a single user, I prompted them to answer a few more "challenge" questions from that user's history, to make sure that the match was correct. (Of course, that type of strategy would not work for a website where you have to be 100% sure, rather than, say, 98% sure, that you've matched the correct person to the account.)
Not the worst, but certainly most infamous thing I've done: I was testing a condition in a frontend template which, if met, left a <!-- leo loves you --> comment in the header HTML of all the sites we served. Unfortunately the condition was always met and I pushed the change without thinking. This was back in the day when bandwidth was precious and extraneous HTML was seriously frowned upon. We didn't realize it was in production for a week, at which point several engineers actually decided to leave it in as a joke. Then someone higher up found out and browbeat me into removing it, citing bandwidth and disk space costs.
Now, if you go to a CNET site and view source, there's a <!-- Chewie loves you --> comment. I like to think of that as an homage to my original fuckup.
I once worked for a company that schedules advertising before films. This wasn't in the US and the company had a monopoly over all of the ads shown across the country. It was my first programming job and done during university holidays, so I was there for a couple of months and then back to university. Toward the end of the following year I get a phone call: something was wrong with the system, it was allowing agents to overbook advertising slots. I diagnosed the problem over the phone and they put a fix in but management decided it was too late for the company to go back and cancel all of the ads that were already booked. This was not surprising as it was the most money they'd ever made. Conveniently, the parent company owned the cinemas so they did a deal where they just showed all of the ads that were booked.
Because of me, one December, everyone in the country who went to the cinema got to watch anywhere between 30 and 45 minutes of ads before the main presentation started.
Lesson learned: write more tests, monitor everything.
I bet > 66% of these are something to do with databases. :-)
My story (though I wasn't directly responsible): we were delivering our software to an obscure government agency. Based on our recommendation, they had ordered a couple of SGI boxes. I wrote the installation script, which copied stuff off the CD, etc. Being a tcsh afficianado, I decided to write it in tcsh with the shebang line
#!/usr/local/bin/tcsh
Anyways: we send them the CD. Some dude on the other side logs in as root, mounts the CD, and tries to run "installme.csh". "command not found" comes the response.
So he peeks at the script, and sees that it's a shell script. He knows enough of unix that "shell == bash". So he runs "bash installme.csh" . A few minutes go by, and lots of errors. So he reboots; now the system won't come up.
The genius that he is, he decides to try the CD on the second SGI box. Same results.
In the script, the first few lines were something like:
set HOME = "/some/location"
/bin/rm -rf $HOME/*
Hint: IRIX didn't ship with /usr/local/bin/tcsh. And guess what's the value of "HOME" in bash?
We were storing payment details sent from a PHP system into a Ruby system, I was responsible for the sending and receiving endpoints. Everything was heavily tested on the Ruby end but the PHP end was a legacy system with no testing framework. Since the details were encrypted on the Ruby end, I didn't do a full test from end to end AND unencrypt the stored results.
Turns out for two months we were storing the string '[Array]' as peoples payment details.
Takeaway: If you're doing an end to end test, make sure you go all the way to the end.
~ 2007, working in a large bioinformatics group with our own very powerful cluster, mainly used for protein folding. Example job: fold every protein from a predicted coding region in a given genome. I was mostly doing graph analysis on metabolic and genetic networks though, and writing everything in Perl.
I had a research deadline coming up in a month, but I was also about to go on a hunting trip and be incommunicado for two weeks. I had to kick off a large job (about 75,000 total tasks) but I figured spread over our 8,000 node cluster it would be okay (GPFS storage, set up for us by IBM). I kicked off the jobs as I walked out the door for the woods.
Except I had been doing all my testing of those jobs locally, and my Perl environment was configured slightly differently on the cluster, so while I was running through billions of iterations on each node I was writing the same warning to STDOUT, over and over. It filled up the disks everywhere and caused an epic I/O traffic jam that crashed every single long-running protein folding job. The disk space issues caused some interesting edge cases and it was basically a few days before the cluster would function properly and not lose data or crash jobs. The best part was that I was totally unreachable and thus no one could vent their ire, causing me to return happy and well-rested to an overworked office brimming with fermented ill-will. And I didn't get my own calculations done either, causing me to miss a deadline.
Lessons learned:
1) PRODUCTION != DEVELOPMENT ever ever ever ever
2) Big jobs should be proceeded by small but qualitatively identical test jobs
3) Don't launch any multi-day builds on a Friday
4) Know what your resource consumption will mean for your colleagues in the best and worst cases
5) Make sure any bad code you've written has been aired out before you go on vacation
6) Don't use Perl when what you really needed was Hadoop
Second web related job at an insurance company, I was 20 years old at the time. We were heavy into online advertising, mostly banners at the time (this was right around when adwords started to get big). The company just bought out all of the MSN finance section of their site for the day-- it was a pretty big campaign ($100,000). We drove all the traffic to a landing page I had created with a short form to "Get a quote".
IT had given me permissions to push things live for quick fixes and such, I made a last minute design tweak and, you guessed it, broke something. I was checking click traffic and inbound leads and realized traffic was through the roof but leads were non-existent. This was about 45 minutes after the campaign was turned on. I jumped on the page and tested it out and got an error on submit. FUCK. I literally started to perspiration INSTANTLY.
Jumped into my form and quickly found the bug, can't recall what it was but something small and stupid, then pushed it live without telling a soul. Tested, worked, re-tested, worked. Ran some quick numbers to get a ballpark estimate on the damage I caused... several thousand.
Stood up and walked over to the two IT guys, mentioned I borked things and that I had fixed it... what should I do? I can still see the look on their faces. Shock, then smiles. Walked back to my desk and about 10 minutes later my two bosses show up (I worked for both dev & marketing managers).
They said thanks for catching the problem, not to worry. I did good for finding it myself, fixing it, and pushing it live. I was still sweating and shaking. They walk off and later that day marketing manager informs me MSN will refund us for the 45 minutes of clicks.
It took about a month before I felt competent enough to touch our forms again.
I was once in charge of running an A/B test at my work. Part of the test involved driving people to a new site using AdWords.
After the test was complete, I forgot to turn off the Adwords. (Such a silly mistake...) Nobody notices until our bill arrives from Google, and it's substantially higher than normal. When my coworker came to ask me about it, "are these your campaigns?!?" I just sank in my chair.
I think it cost the company $30k. I suppose it's not that much money in the grand scheme of things, but I felt very bad.
When I worked at ClearChannel back in 2010, we rebuilt Rush Limbaugh's site. When migrating over the billing system, I realized a flaw that granted at least 20,000 people free access to the audio archive ($7.95/month). The billing provider processed the subscriptions, but their system would only sync with our authentication database once a week with a diff of accounts added or removed in the past 7 days. You got the first 7 days free for this reason. If this process failed (e.g. due to a connectivity issue, timeout, or SQL error), all accounts after the error would not be updated. Anyone with a free trial or people who cancelled during a week with an error would get a permanent free trial. I rewrote the code to handle errors and retry on failure so that errors wouldn't happen in the future, but my downfall was running a script that updated all accounts to the correct status. Imagine angry Rush Limbaugh fans used to getting something for free now getting cut off (even though it shouldn't have been free). Management quickly made the decision to give them free access anyway, so I rolled back the change.
During a server migration for our web based file sharing system our lead engineer (at the time) forgot to ensure that all cron jobs (for cleaning up files and sending out automated emails) had been turned back on.
Queue me 7mos later reviewing the system. Realizing that critical jobs were no longer running and that our users were all essentially receiving 100% free hosting for however much storage they wanted. SOOOO i turned the jobs back on.
The lead engineer before me left no documentation of what the jobs did other than that they should be run. In my stupor i did not review the code. The jobs sent out a blast of emails warning that files would be deleted if not cleaned up or maintained. Then seconds later deleted said files...
We nuked around 70GB worth of files before we realized what happened. WELL GET THE TAPES! Turns out our lead engineer ALSO forgot to follow up w/ system engineers and the backups were pointed at the wrong storage.
No jobs lost, thankfully the manager at the time was a word smith of the highest degree and can play political baseball like a GOD.
[+] [-] patio11|12 years ago|reply
Lesson #1: Don't code when you're distracted.
Some hours later, the problem manifested. The queue workers came down, and AR (which is totally dependent on them for its core functionality) immediately stopped doing the thing customers pay me money to do. My monitoring system picked up on this and attempted to call me -- which would have worked great, except my cell phone was in a box that wasn't unpacked yet.
Lesson #2a: If you're running something mission critical, and your only way to recover from failure means you have to wake up when the phone rings, make sure that phone stays on and by you.
Later that evening I felt a feeling of vague unease about my change earlier and checked my email from my iPad. My inbox was full of furious customers who were observing, correctly, that I was 8 hours into an outage. Oh dear. I ssh'ed in from the iPad, reverted my last commit, and restarted the queue workers. Queues quickly went down to zero. Problem solved right?
Lesson #3: If at all possible, avoid having to resolve problems when exhausted/distracted. If you absolutely must do it, spend ten extra minutes to make sure you actually understand what went wrong, what your recovery plan is, and how that recovery plan will interact with what went wrong first.
AR didn't use idempotent queues (Lesson #4: Always use idempotent queues), so during the outage, every 5 minutes on a cron job every person who was supposed to be contacted that day got one reminder added to the queue. Fortuitously, AR didn't have all that many customers at the time, so only 15 or so people were affected. Less than fortuitously, those 15 folks had 10 to 100 messages queued, each. As soon as I pressed queues.restart() AR delivered all of those phone calls, text messages, and emails. At once.
Very few residential phone systems or cell phones respond in a customer-pleasing manner to 40 simultaneous telephone calls. It was a total DDOS on my customers' customers.
I got that news at 3 AM in the morning Japan time, at my new apartment, which didn't have Internet sufficient to run my laptop and development environment to see e.g. whose phones I had just blown up. Ogaki has neither Internet cafes nor taxis available at 3 AM in the morning. As a result, I had to put my laptop in a bag and walk across town, in the freezing rain, to get back to my old apartment, which still had a working Internet connection.
By the time I had completed the walk of shame I was drenched, miserable, and had magnified the likely impact that this had on customers' customers in my own mind. Then I got to my old apartment and checked email. The first one was, as you might expect, rather irate. And I just lost it. Broke down in tears. Cried for a good ten minutes. Called my father to explain what had happened, because I knew that I had to start making apology calls and wasn't sure prior to talking to him that I'd be able to do it without my voice breaking.
The end result? Lost two customers, regained one because he was impressed by my apology. The end users were mostly satisfied with my apologies. (It took me about two hours on the phone, as many of them had turned off their phones when they blew up.)
You'd need a magnifying glass to detect it ever happened, looking on any chart of interest to me. The software got modestly better after I spent a solid two weeks on improved fault tolerance and monitoring.
Lesson the last: It's just a job/business. The bad days are usually a lot less important in hindsight than they seem in the moment.
[+] [-] d0m|12 years ago|reply
Lesson #1: Don't push code on Friday afternoon.
Lesson #2: Beer, Code and Commit is totally fine. Just don't push! Wait until next day to review and push/deploy it..
[+] [-] zamalek|12 years ago|reply
> Don't code when you're distracted.
Same story here, I can't remember the exact scenario but I was concurrently acting under all three of my titles (Developer, Architect, Escalation Engineer/Critical Debugger). The customer (who was 7 hours different to us) had been battling for 3 months and we were getting nowhere (all thanks to a, since fixed, bug in WinDBG which essentially came down to broken stack traces in certain scenarios), for those 3 months I had been working 20 hour shifts (development by day, support by night).
Under that strain I eventually made a screw up with the dev and it cost QA time. The MD of my region had a sit down with me and I
> cried literal tears
Needless to say they were impressed that I someone cared so much and sent me home to sleep. The next day I came in and decided to go through the 800kloc codebase line-by-line and see what could be causing the issue - I found it in a few hours.
Lesson #4: Get sleep.
[+] [-] bedhead|12 years ago|reply
[+] [-] BU_student|12 years ago|reply
Excuse me for caviling at your post, but "fortuitously" is a synonym for "accidentally", not "fortunately".
[+] [-] bradbatt|12 years ago|reply
[+] [-] zem|12 years ago|reply
[+] [-] yan|12 years ago|reply
She found out about it pretty quickly due to having syslog be a constant presence in one of her gnu screen windows and gave me a look. She quickly reverted what I did, updated our config management tool, tested it, then deployed it, while explaining why this was the right way to do things. I slowly came around to doing things the right way and haven't thought much about the initial incident until we found her personal logs that she archived and left on our public network share for future reference.
In the entries for the day that I started, we saw the following two lines:
[+] [-] sillysaurus2|12 years ago|reply
I'm amazed that this is possible. How would I set something like that up? A realtime log of only the most significant events of a remote system?
In fact, I'd like to take this opportunity of ignorance-admitting to ask the community for general linux/bsd sysadmin resources. What books should I read, or what topics should I study? I want to become an expert at modern sysops. Modern deployment, hardening, backup, managing dozens of boxen, etc.
I've been thinking of going through any MIT OCW on the subject, but it seems like hard-earned experience might not necessarily translate well to an academic setting. What would you recommend I do?
[+] [-] cunac|12 years ago|reply
[+] [-] gmays|12 years ago|reply
Nobody was killed, but we had a few injured. Thankfully the brunt of it hit the MRAP in front of us. If it hit my vehicle (HMMWV, flat bottom) instead I probably wouldn't be here.
That was the first major operation on my first deployment, too. Hello, world!
My takeaway? Shit just got real.
We ended up stranded that night after the 3rd IED strike (our "rescuers" said it was too dangerous to get us). It was the scariest day of my life, but in similar future situations it was different. I still felt fear and the reality of the existential threat, but I accepted it. It was almost liberating. Strange.
I deployed for another year after that (to Afghanistan that time). After Afghanistan I left the Corps and started my company. Because if it fails, what's the worst that can happen? Lulz.
[+] [-] kadabra9|12 years ago|reply
Depending on the industry, that might not be the case though. Thanks for your service.
[+] [-] ggreer|12 years ago|reply
I was a little sleepy one morning and accidentally connected to prod instead of testing. I thought, "That's weird, this UPDATE shouldn't have taken so long-oh shit." I'd managed to clear all allergy and malignant hyperthermia fields. For all I knew, some anesthesiologist would kill a patient because of my mistake. I was shaking. I immediately found the technical lead, pulled him from a meeting, and told him what happened. He'd been smart enough to set up hourly DB snapshots and query logs. It only took five minutes to restore from a snapshot and replay all the logs, not including my UPDATE.
Afterwards, my access to prod was not revoked. We both agreed I'd learned a valuable lesson, and that I was unlikely to repeat that mistake. The tech lead explained the incident to the higher-ups, who decided to avoid mentioning anything to the affected hospitals.
If it's any consolation, the company is no longer in business.
Just remember when you screw things up: Your mistake probably won't get anyone killed, so don't panic too much.
[+] [-] munificent|12 years ago|reply
Blaming yourself here is like blaming yourself for being hurt after being told to drive a car with no seatbelt or brakes.
[+] [-] matellis|12 years ago|reply
The story reminds me of the day I was introduced to "BEGIN TRANS", "COMMIT" and "ROLLBACK" when someone upgraded the Sybase console and helpfully changed the default setting so we didn't need those pesky semi-colons to finish a query any more. The result was:
Phone starts to ring a few seconds later as all the users saw their morning's work disappear.This stuff is way too easy for us noobs. Thank goodness that with modern technology we've found ways to make sure it doesn't happen any more... :-)
[+] [-] bradbatt|12 years ago|reply
That right there is one of the worst feelings in the world. I imagine that everyone on HN just felt it with you as they were reading it!
Totally agree that the higher-ups were responsible for not putting better roadblocks in place to prevent this type of thing from happening.
[+] [-] smartician|12 years ago|reply
We were able to restore the data pretty quickly, but we had to interrupt warehouse workflow for a few minutes. They were surprisingly accommodating, almost amused by my mistake.
[+] [-] hluska|12 years ago|reply
When I worked at Subway, the bread dough came frozen, but you would put loaves in a proofer, proof it for a certain amount of time, and then bake it. My first shift, however, got busy and I left several trays in the proofer for a very, very long time. Consequently, they rose to roughly the size of loaves of bread, as opposed to the usual buns.
It was my very first shift alone at any job in my life, so I did the most logical thing I could think of and put the massive buns in the oven. They cooked up nicely enough and I thought I was saved. Until I tried to cut into one.
Back in that day, Subway used to cut those silly u-shaped gouges out of their buns. In retrospect, I think this was most likely a bizarre HR technique designed to weed out the real dummies, but at the time I was oblivious (likely because I was one of the dummies they should have weeded out). When I ran out of the normal bread, I grabbed one of my monstrosities, tried to cut into it, and discovered that it was not only rock hard, but the loaf broke apart as I tried to cut it.
That night, my severe shyness and social awkwardness had their first run-in with beasts known as angry customers. I was scared I would get fired, so I promptly made new buns, but spent the rest of my shift trying to get rid of my blunder. I discovered some really interesting things about people that night. First, you'd be surprised how incredibly nice customers are if you are straight up with them. Some customers I never met before met the big, crumbly buns as an adventure and, in doing so, helped me sell all the ruined buns.
In the end, I came clean (and didn't get fired). That horrible night was a huge event in the dismantling of my shell. It taught me an awful lot about ethics. And frankly, that brief experience in food service forever changed how I deal with staff in similar types of jobs.
[+] [-] lostlogin|12 years ago|reply
[+] [-] canadev|12 years ago|reply
[+] [-] ericcope|12 years ago|reply
[+] [-] benaiah|12 years ago|reply
[+] [-] Smerity|12 years ago|reply
Surprisingly it all seemed to work well. These disaster recovery steps weren't heavily tested before. Brilliant! I went to shut down the AWS instances. Kill DB group. Wait. Wait... The DB group? Wasn't it DB-test group...
I'd just killed all the production databases. And the streaming replicas. And... everything... All at the busiest time of day for our site.
Panic arose in my chest. Eyes glazed over. It's one thing to test disaster recovery when it doesn't matter, but when it suddenly does matter... I turned to the disaster recovery code I'd just been testing. I was reasonably sure it all worked... Reasonably...
Less than five minutes later, I'd spun up a brand new database cluster. The only loss was a minute or two of user transactions, which for our site wasn't too problematic.
My friends joked later that at least we now knew for sure that disaster recovery worked in production...
Lesson: When testing disaster recovery, ensure you're not actually creating a disaster in production.
[+] [-] grecy|12 years ago|reply
'hyper-committed disaster recovery testing'
[+] [-] yen223|12 years ago|reply
A relatively minor bug in the software that I wrote caused the safety curtain to stop triggering when a certain condition was met. We discovered this bug after an operator was injured by one of these machines. Her hand needed something like 14 stitches.
Lessons learnt:
1. Event-driven code is hard.
2. There's no difference between a 'relatively minor' bug and a major one. The damage is still the same.
[+] [-] BuildTheRobots|12 years ago|reply
[+] [-] HeyLaughingBoy|12 years ago|reply
[+] [-] michh|12 years ago|reply
[+] [-] dboyd|12 years ago|reply
I was hired by my college to build a grade management system in my second-to-last year there. I was in a hurry due to a lunch meeting with other IT staff at the University, forgot to add the where clause, and suddenly every single student was a Computational-Science major (mine).
Funny part of the story was that the moment it happened I uttered "oh shit." My boss, who sat beside me, said "what'd you do?", and about 15 IT staff from other departments walked into the office to go out for lunch. I'm sure I was an interesting shade of red.
I had to explain what I did in front of all these people. My boss laughed out loud, brought the system offline, and simply said: "well, after lunch we get to test our backup process." We went for lunch.
Two valuable lessons I learned...
People make mistakes, that isn't a problem, it's how they respond that's important.
Don't try and solve hard problems when emotions are running high. If shit is going down in production, the most important thing to do is to breathe, and get a glass of water. That little bit of time helps a lot.
[+] [-] glogla|12 years ago|reply
Of course, you should probably do every update of production data in a transaction, check the result and then commit, and if you want to be sure, you can do UPDATE ... RETURNING to check what's changing. Autocommit on manual access to production is pretty crazy. But still, flashback is useful.
[+] [-] Volundr|12 years ago|reply
[+] [-] Davertron|12 years ago|reply
[+] [-] codegeek|12 years ago|reply
[+] [-] marvvelous|12 years ago|reply
> UPDATE
> UPDATE SET url='new_url' WHERE source_id IN (etc)
> UPDATE sources SET url='new_url' WHERE source_id IN (etc)
[+] [-] gargarplex|12 years ago|reply
Why doesn't MySQL have a version control baked in? Even if it preserves just the last n hours of state..
[+] [-] Volundr|12 years ago|reply
[+] [-] jpatokal|12 years ago|reply
[+] [-] kchoudhu|12 years ago|reply
I did this in a production database (thought it was a QA environment) and brought trading on the mortgage desk of an investment bank to a grinding halt on September 14th, 2008.
The DBAs saved my 23 year old ass that day. I make it a point to send them beer on 9/14 every year.
[+] [-] jawns|12 years ago|reply
I had had some test tables sitting around in the database for a while and decided to clean them up. I stupidly forgot to check the status of my backups; because of an earlier error, they were not being correctly saved.
So, I had a bunch of tables with similar names:
I decided to delete them all in one big swoop.Guess what got deleted along with them? The actual users table (which I've since renamed to something that does not even contain "users" in it).
So, how do you recover a users table when you've just deleted it and your backup has failed?
Well, I happened to have all of my users' email addresses stored in a separate mailing list table, but that table did not store their associated user IDs.
So I sent them all an email, prompting them to visit a password reset page.
When they visited the page, if their user ID was stored in a cookie -- and for most of them, it was -- I was able to re-associate their user ID with their email address, prompt them to select a new password, and essentially restore their account activity.
There was a small subset of users who did not have their user IDs stored in a cookie, though.
Here's how I tackled that problem:
Because the bulk of a user's activity on the site involves answering poll questions, I prompted them to select some poll questions that they had answered previously, and that they were certain they could answer again in the same way. I was then able to compare their answers to the list of previous responses and narrow down the possibilities. Once I had narrowed it down to a single user, I prompted them to answer a few more "challenge" questions from that user's history, to make sure that the match was correct. (Of course, that type of strategy would not work for a website where you have to be 100% sure, rather than, say, 98% sure, that you've matched the correct person to the account.)
[+] [-] leothekim|12 years ago|reply
Now, if you go to a CNET site and view source, there's a <!-- Chewie loves you --> comment. I like to think of that as an homage to my original fuckup.
[+] [-] itwasme|12 years ago|reply
Because of me, one December, everyone in the country who went to the cinema got to watch anywhere between 30 and 45 minutes of ads before the main presentation started.
Lesson learned: write more tests, monitor everything.
[+] [-] discardorama|12 years ago|reply
My story (though I wasn't directly responsible): we were delivering our software to an obscure government agency. Based on our recommendation, they had ordered a couple of SGI boxes. I wrote the installation script, which copied stuff off the CD, etc. Being a tcsh afficianado, I decided to write it in tcsh with the shebang line
Anyways: we send them the CD. Some dude on the other side logs in as root, mounts the CD, and tries to run "installme.csh". "command not found" comes the response. So he peeks at the script, and sees that it's a shell script. He knows enough of unix that "shell == bash". So he runs "bash installme.csh" . A few minutes go by, and lots of errors. So he reboots; now the system won't come up. The genius that he is, he decides to try the CD on the second SGI box. Same results.In the script, the first few lines were something like:
Hint: IRIX didn't ship with /usr/local/bin/tcsh. And guess what's the value of "HOME" in bash?[+] [-] snikch|12 years ago|reply
We were storing payment details sent from a PHP system into a Ruby system, I was responsible for the sending and receiving endpoints. Everything was heavily tested on the Ruby end but the PHP end was a legacy system with no testing framework. Since the details were encrypted on the Ruby end, I didn't do a full test from end to end AND unencrypt the stored results.
Turns out for two months we were storing the string '[Array]' as peoples payment details.
Takeaway: If you're doing an end to end test, make sure you go all the way to the end.
[+] [-] jboggan|12 years ago|reply
~ 2007, working in a large bioinformatics group with our own very powerful cluster, mainly used for protein folding. Example job: fold every protein from a predicted coding region in a given genome. I was mostly doing graph analysis on metabolic and genetic networks though, and writing everything in Perl.
I had a research deadline coming up in a month, but I was also about to go on a hunting trip and be incommunicado for two weeks. I had to kick off a large job (about 75,000 total tasks) but I figured spread over our 8,000 node cluster it would be okay (GPFS storage, set up for us by IBM). I kicked off the jobs as I walked out the door for the woods.
Except I had been doing all my testing of those jobs locally, and my Perl environment was configured slightly differently on the cluster, so while I was running through billions of iterations on each node I was writing the same warning to STDOUT, over and over. It filled up the disks everywhere and caused an epic I/O traffic jam that crashed every single long-running protein folding job. The disk space issues caused some interesting edge cases and it was basically a few days before the cluster would function properly and not lose data or crash jobs. The best part was that I was totally unreachable and thus no one could vent their ire, causing me to return happy and well-rested to an overworked office brimming with fermented ill-will. And I didn't get my own calculations done either, causing me to miss a deadline.
Lessons learned:
1) PRODUCTION != DEVELOPMENT ever ever ever ever 2) Big jobs should be proceeded by small but qualitatively identical test jobs 3) Don't launch any multi-day builds on a Friday 4) Know what your resource consumption will mean for your colleagues in the best and worst cases 5) Make sure any bad code you've written has been aired out before you go on vacation 6) Don't use Perl when what you really needed was Hadoop
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] admiraltbags|12 years ago|reply
Second web related job at an insurance company, I was 20 years old at the time. We were heavy into online advertising, mostly banners at the time (this was right around when adwords started to get big). The company just bought out all of the MSN finance section of their site for the day-- it was a pretty big campaign ($100,000). We drove all the traffic to a landing page I had created with a short form to "Get a quote".
IT had given me permissions to push things live for quick fixes and such, I made a last minute design tweak and, you guessed it, broke something. I was checking click traffic and inbound leads and realized traffic was through the roof but leads were non-existent. This was about 45 minutes after the campaign was turned on. I jumped on the page and tested it out and got an error on submit. FUCK. I literally started to perspiration INSTANTLY.
Jumped into my form and quickly found the bug, can't recall what it was but something small and stupid, then pushed it live without telling a soul. Tested, worked, re-tested, worked. Ran some quick numbers to get a ballpark estimate on the damage I caused... several thousand.
Stood up and walked over to the two IT guys, mentioned I borked things and that I had fixed it... what should I do? I can still see the look on their faces. Shock, then smiles. Walked back to my desk and about 10 minutes later my two bosses show up (I worked for both dev & marketing managers).
They said thanks for catching the problem, not to worry. I did good for finding it myself, fixing it, and pushing it live. I was still sweating and shaking. They walk off and later that day marketing manager informs me MSN will refund us for the 45 minutes of clicks.
It took about a month before I felt competent enough to touch our forms again.
[+] [-] nostromo|12 years ago|reply
After the test was complete, I forgot to turn off the Adwords. (Such a silly mistake...) Nobody notices until our bill arrives from Google, and it's substantially higher than normal. When my coworker came to ask me about it, "are these your campaigns?!?" I just sank in my chair.
I think it cost the company $30k. I suppose it's not that much money in the grand scheme of things, but I felt very bad.
[+] [-] byoung2|12 years ago|reply
[+] [-] tptacek|12 years ago|reply
https://www.google.com/search?q=ptacek+kaminsky+leak
[+] [-] killertypo|12 years ago|reply
Queue me 7mos later reviewing the system. Realizing that critical jobs were no longer running and that our users were all essentially receiving 100% free hosting for however much storage they wanted. SOOOO i turned the jobs back on.
The lead engineer before me left no documentation of what the jobs did other than that they should be run. In my stupor i did not review the code. The jobs sent out a blast of emails warning that files would be deleted if not cleaned up or maintained. Then seconds later deleted said files...
We nuked around 70GB worth of files before we realized what happened. WELL GET THE TAPES! Turns out our lead engineer ALSO forgot to follow up w/ system engineers and the backups were pointed at the wrong storage.
No jobs lost, thankfully the manager at the time was a word smith of the highest degree and can play political baseball like a GOD.