top | item 5292591

How I Fired Myself

620 points| mkrecny | 13 years ago |edu.mkrecny.com | reply

411 comments

order
[+] bguthrie|13 years ago|reply
More than anything else, this describes an appalling failure at every level of the company's technical infrastructure to ensure even a basic degree of engineering rigor and fault tolerance. It's noble of the author to quit, but it's not his fault. I cannot believe they would have the gall to point the blame at a junior developer. You should expect humans to fail: humans are fallible. That's why you automate.
[+] potatolicious|13 years ago|reply
More than that, it's telling that the company threw him under the bus when it happened. I've been through major fuckups before, and in all cases the team presents a united front - the company fucked up, not an individual.

Which is, if you think about it, true, given that the series of events leading up to the disaster (the lack of a testing environment, working with prod databases, lack of safeties in the tools used to connect to database, etc...).

The correct way to respond to disasters like this is "we fucked up", not "someone fucked up".

[+] downandout|13 years ago|reply
Agreed. How could a company with "millions in revenue" not backup critical databases? Not only were they exposed to the threat of human error, but hardware failures, hackers, etc. When he submitted his resignation, the company should have encouraged him to stay. Instead, anyone at the company that had anything to do with the failure to implement regular database backups and the use of redundant databases should have been fired.

I've had more than my fair share of failures in the start up world. It always drives me crazy to see internet companies that have absolutely no technical ability succeed, while competing services that are far superior technically never get any traction.

[+] pavel_lishin|13 years ago|reply
> It's noble of the author to quit

It's not just noble. Considering how everyone treated him, and the company's attitude in general, he had no future there, and neither should anyone else.

[+] liquidise|13 years ago|reply
Exactly what you said. Rigor is truly the right word to use here. Cancelling your db backups is basically asking for a disaster. I'm not sure i have ever been at job that didnt require a db backup for some reason at some time.
[+] x3ord|13 years ago|reply
Exactly. I'm ashamed that my first reaction reading this was to blame OP. But in the 2 min it took to read the post I had come full circle to wondering what kind of terribly run company would allow this to happen--I guess the type that hires philosophy majors straight out of college without vetting their engineering skills.
[+] j_baker|13 years ago|reply
"Nobody cares about technical infrastructure. Our customers don't pay us for engineering rigor. We need to just ship!"

Of course the person saying that is likely to care about technical infrastructure when it costs them money and/or customers due to being hacked-together.

[+] gtirloni|13 years ago|reply
When I was a junior analyst, I once deleted the main table that contained all 70k+ users, passwords, etc. The problem was fixed in 15min after the DBA was engaged to copy all data back from the QA environment that was synchronized every X minutes. Or we could have restored a back from a few hours ago.

The whole company fucked this one up pretty badly. NO excuses.

[+] jcromartie|13 years ago|reply
Indeed. If it had been a sporadic hardware failure, they would have been exactly as screwed. The fact that they gave an overworked junior dev direct read/write access to the production database is astounding.
[+] SteveGerencser|13 years ago|reply
I worked for a startup a few years back where the CEO deleted the entire 1+TB database/website when the server was hacked and being used as a spam server because he 1. didn't know who to disable the site short of deleting and 2. could reach anyone that did know how.

The next morning he told us to restore the site from backups and fix the security hole. That's when we reminded him, again, that he had refused to pay for backup services for a site of that size.

We all ended up looking fro new jobs withing a couple days.

[+] readme|13 years ago|reply
I wish I could upvote this more. Production database? Seriously, they were one copy away from avoiding this whole outcome.

CEO sounds incompetent as hell.

[+] fitandfunction|13 years ago|reply
Exactly. Problems with production databases are inevitable. It's just a matter of time.

The guy who should be falling on the sword, if anyone, is the person in charge of backups.

Better yet, the CEO or CTO should have made this a learning opportunity and taken the blame for the oversight + praised the team for banding together and coming up with a solution + a private chat with the OP.

[+] benjaminwootton|13 years ago|reply
I came here to type that almost exactly word for word.

This is a truly amazing story if this system was really supporting milions in revenue.

[+] mncolinlee|13 years ago|reply
I cannot even begin to fathom how they functioned without a working development environment for testing, let alone let their backups lapse.

The kind of table manipulations he mentions would be unspeakable in most companies. Someone changing the wrong table would be inevitable. If I were an auditor, I would rake them all over the coals.

[+] jliptzin|13 years ago|reply
Unbelievable. Even at my 3 person startup, back in 2010, with thousands in revenue, not millions, we had development environments with test databases and automated daily database snapshotting. Sure I've accidentally truncated a few tables in my time, but luckily I wasn't dumb enough to be developing on a production server.
[+] sybhn|13 years ago|reply
Absolutely. Your immediate technical management sucked, and you were made the scape goat for your management's failure. Welcome to the real world. Don't get me wrong, you should feel bad, very bad, bad enough so that never again you do that. But you shouldn't feel guilty nor rethink your career.
[+] coopdog|13 years ago|reply
That's the Agile way, people not processes.

When the people screw up, there's no processes to hold them back, they can really, really screw it up.

Once you hit millions in revenue it's probably wise to put a couple of fall back processes in place, reliability becomes as important as agility.

[+] richo|13 years ago|reply
You're forgetting the guy who didn't speak up to say "Hey, maybe we shouldn't do this in prod?".

I feel for him, but at the same time there's a point at which you have to if testing guns by shooting them near (but not specifically at) your coworkers is actually a good idea.

[+] gawker|13 years ago|reply
Agree with the point that humans are fallible. We should always have back ups. A company with 1000s of paying customers should at least take steps to protect itself from this sort of catastrophe.
[+] cybernomad99|13 years ago|reply
To the credit of the management, they did not fire him. He resigned. But the coworkers felt he was responsible personally. That makes a uneasy work environment.
[+] columbo|13 years ago|reply
News flash,

If you are a CEO you should be asking this question: "How many people in this company can unilaterally destroy our entire business model?"

If you are a CTO you should be asking this question: "How quickly can we recover from a perfect storm?"

They didn't ask those questions, they couldn't take responsibility, they blamed the junior developer. I think I know who the real fuckups are.

As an aside: Way back in time I caused about ten thousand companies to have to refile some pretty important government documents because I was doubling xml decoding (& became &). My boss actually laughed and was like "we should have caught this a long time ago"... by we he actually meant himself and support.

[+] hga|13 years ago|reply
If you are a CEO you should be asking this question: "How many people in this company can unilaterally destroy our entire business model?"

In high tech this can get really messy, these are frequently inherently more fragile companies. My favorite example is from Robert X. Cringley in this great book: http://www.amazon.com/Accidental-Empires-Silicon-Millions-Co... ; from memory:

One day Intel's yields suddenly went to hell (that's the ratio of working die on a wafer to non-working, and is a key to profitability). And no matter how hard they tried, they could only narrow it down to the wafers being contaminated, but the wafer supplier swore up and down they were shipping good stuff, and they were. So eventually they tasked a guy to follow packages from the supplier all the way to the fab lines, and he found the problem in Intel's receiving department. Where a clerk was breaking open the sealed packages and counting out the wafers on his desk to make damned sure Intel was getting its money worth....

His point is that you can have a Fortune 500 company, normally thought to be stable companies that won't go "poof" without ample warning, in which there are many more people than in previous kinds of companies who can very quickly kill it dead.

[+] novax81|13 years ago|reply
The difference between working with bad bosses and good bosses really shows itself when there's a disaster going on.

A mere few months into my current job, I ran an SQL query that updated some rows without properly checking one of the subqueries. Long story short - I fucked up an entire attribute table on a production machine (client insisted they didn't need a dev clone). I literally just broke down and started panicking, sure that I'd just lost my new job and wouldn't even have a reference to run on for a new one.

After a few minutes of me freaking out, my boss just says: "You fucked up. I've fucked up before, probably worse than this. Let's go fix it." And that was it. We fixed it (made it even better than it was, as we fixed the issue I was working on in the process), and instead of feeling like an outcast who didn't belong, I learned something and became a better dev for it.

[+] lsc|13 years ago|reply
>If you are a CEO you should be asking this question: "How many people in this company can unilaterally destroy our entire business model?"

This is a question that the person in charge of backups needs to think about, too. I mean, rephrase it as "Is there any one person who can write to both production and backup copies of critical data?" but it means the same thing as what you said.

(and if the CTO, or whoever is in charge of backups screws up this question? the 'perfect storm' means "all your data is gone" - dono about you, but my plan for that involves bankruptcy court, and a whole lot of personal shame. Someone coming in and stealing all the hardware? not nearly as big of a deal, as long as I've still got the data. My own 'backup' house is not in order, well, for lots of reasons, mostly having to do with performance, so I live with this low-level fear every day.)

Seriously, think, for a moment. There's at least one kid with root on production /and/ access to the backups, right? At most small companies, that is all your 'root-level' sysadmins.

That's bad. What if his (or her) account credentials get compromised? (or what if they go rogue? it happens. Not often, and usually when it does it's "But this is really best for the company" It's pretty rare that a SysAdmin actively and directly attempts to destroy a company.)

(SysAdmins going fully rogue is pretty rare, but I think it's still a good thought experiment. If there is no way for the user to destroy something when they are actively hostile, you /know/ they can't destroy it by accident. It's the only way to be sure.)

The point of backups, primarily, is to cover your ass when someone screws up, primarily. (RAID, on the other hand, is primarily to cover your ass when hardware fails) - RAID is not Backup and Backup is not RAID. You need to keep this in mind when designing your backup, and when designing your RAID.

(Yes, backup is also nice when the hardware failure gets so bad that RAID can't save you; but you know what? that's pretty goddamn rare, compared to 'someone fucked up.')

I mean, the worst case backup system would be a system that remotely writes all local data off site, without keeping snapshots or some way of reverting. That's not a backup at all; that's a RAID.

The best case backup is some sort of remote backup where you physically can't overwrite the goddamn thing for X days. Traditionally, this is done with off-site tape. I (or rather, your junior sysadmin monkey) writes the backup to tape, then tests the tape, then gives the tape to the iron mountain truck to stick in a safe. (if your company has money; if not, the safe is under the owner's bed.)

I think that with modern snapshots, it would be interesting to create a 'cloud backup' service where you have a 'do not allow overwrite before date X' parameter, and it wouldn't be that hard to implement, but I don't know of anyone that does it. The hard part about doing it in house is that the person who managed the backup server couldn't have root on production and vis-a-vis, or you defeat the point, so this is one case where outsourcing is very likely to be better than anything you could do yourself.

[+] ritchiea|13 years ago|reply
This is so true. I had to leave my first junior dev position for similar reasons as the OP, though nothing as monumental.

I was handed a legacy codebase with zero tests. I left a few small bugs in production, and got absolutely chewed out for it. It was never an issue with our processes, it was obviously an issue with the guy they hired who had 1 intro CS class and 1 rails hobby project on his resume. The lead dev never really cared that we didn't have tests, or a careful deploy process. He just got angry when things went wrong. And even gave one more dev with even less experience than I had access to the code base.

It was a mess and the only thing I was "learning" was "don't touch any code you don't have to because who knows what you might break" which is obviously a terrible way to learn as a junior dev (forget "move fast and break things" we "move slowly and work in fear!"). So I quit and moved on, it was one of the better decisions I've ever made.

[+] xentronium|13 years ago|reply
This is certainly a monumental fuckup, but these things inevitably happen even with better development practices, this is why you need backups, preferably daily, and as much separation of concerns and responsibilities as humanly possible.

Anecdote:

I am working for a company that does some data analysis for marketers aggregated from a vast number of sources. There was a giant legacy MyISAM (this becomes important later) table with lots of imported data. One day, I made some trivial looking migration (added a flag column to that table). I tested it locally, rolled it out to staging server. Everything seemed A-OK until we started migration on the production server. Suddenly, everything broke. By everything, I mean EVERYTHING, our web application showed massive 500-s, total DEFCON1 across the whole company. It turned out we ran out of disk space, since apparently myisam tables are altered the following way: first the table is created with updated schema, then it is populated with data from the old table. MyISAM ran out of disk space and somehow corrupted the existing tables, mysql server would start with blank tables, with all data lost.

I can confirm this very feeling: "The implications of what I'd just done didn't immediately hit me. I first had a truly out-of-body experience, seeming to hover above the darkened room of hackers, each hunched over glowing terminals." Also, I distinctly remember how I shivered and my hands shook. It felt like my body temperature fell by several degrees.

Fortunately for me, there was a daily backup routine in place. Still, several hour long outage and lots of apologies to angry clients.

"There are two types of people in this world, those who have lost data, and those who are going to lose data"

[+] grey-area|13 years ago|reply
Tens of thousands of paying customers and no backups?

No staging environment (from which ad-hoc backups could have been restored)!?!?

No regular testing of backups to ensure they work?

No local backups on dev machines?!?

Using a GUI tool for db management on the live db?!?!?

No migrations!?!?!

Junior devs (or any devs) testing changes on the live db and wiping tables?!?!?!

What an astonishing failure of process. The higher ups are definitely far more responsible than some junior developer for this, he shouldn't have been allowed near the live database in the first place until he was ready to take changes live, and then only on to a staging environment using migrations of some kind which could then be replayed on live.

They need one of these to start with, then some process:

http://www.bnj.com/cowboy-coding-pink-sombrero/

[+] cmos|13 years ago|reply
When I was 18 I took out half my towns power for 30 minutes with a bad scada command. It was my summer job before college and I went from cleaning the warehouse to programming the main SCADA control system in a couple weeks.

Alarms went off, people came running in freaking out, trucks started rolling out to survey the damage, hospitals started calling about people on life support and how the backup generators were not operational, old people started calling about how they require AC to stay alive and should they take their loved ones on machines to the hospital soon.

My boss was pretty chill about it. "Now you know not to do that" were his words of wisdom, and I continued programming the system for the next 4 summers with no real mistakes.

[+] cedsav|13 years ago|reply
Whoever was your boss should have taken responsibility. Someone gave you access to the production database instead of setting up a proper development and testing environment. For a company doing "millions" in revenues, it's odd that they wouldn't think of getting someone with a tiny bit of experience to manage the development team.
[+] mootothemax|13 years ago|reply
The CEO leaned across the table, got in my face, and said, "this, is a monumental fuck up. You're gonna cost us millions in revenue".

No, the CEO was at fault, as was whoever let you develop against the production database.

If the CEO had any sense, he should have put you in charge of fixing the issue and then making sure it could never happen again. Taking things further, they could have asked you to find other worrying areas, and come up with fixes for those before something else bad happens.

I have no doubt that you would have taken the task extremely seriously, and the company would have ended up in a better place.

Instead, they're down an employee, and the remaining employees know that if they make a mistake, they'll be out of the door.

And they still have an empty users table.

[+] hackoder|13 years ago|reply
I was in a situation very similar to yours. Also a game dev company, also lots of user data etc etc. We did have test/backup databases for testing, but some data was just on live and there was no way for me to build those reports other than to query the live database when the load was lower.

In any case, I did a few things to make sure I never ended up destroying any data. Creating temporary tables and then manipulating those.. reading over my scripts for hours.. dumping table backups before executing any scripts.. not executing scripts in the middle/end of the day, only mornings when I was fresh etc etc.

I didn't mess up, but I remember how incredibly nerve wracking that was, and I can relate to the massive amount of responsibility it places on a "junior" programmer. It just should never be done. Like others have said, you should never have been in that position. Yes, it was your fault, but this kind of responsibility should never have been placed on you (or anyone, really). Backing up all critical data (what kind of company doesn't backup its users table?! What if there had been hard disk corruption?), and being able to restore in minimum time should have been dealt with by someone above your pay grade.

[+] arethuza|13 years ago|reply
Out of interest, why not create a database user account that is read only and use that?
[+] Yare|13 years ago|reply
If it helps explain things, the only experience the CEO had before this social game shop was running a literal one-man yogurt shop.

This happened a week before I started as a Senior Software Engineer. I remember getting pulled into a meeting where several managers who knew nothing about technology were desperately trying to place blame, figure out how to avoid this in the future, and so on.

"There should have been automated backups. That's really the only thing inexcusable here.", I said.

The "producer" (no experience, is now a director of operations, I think?) running the meeting said that was all well and good, but what else could we do to ensure that nobody makes this mistake again? "People are going to make mistakes", I said, "what you need to focus on is how to prevent it from sinking the company. All you need for that is backups. It's not the engineer's fault.". I was largely ignored (which eventually proved to be a pattern) and so went on about my business.

And business was dumb. I had to fix an awful lot of technical things in my time there.

When I started, only half of the client code was in version control. And it wasn't even the most recent shipped version. Where was the most recent version? On a Mac Mini that floated around the office somewhere. People did their AS3 programming in notepad or directly on the timeline. There were no automated builds, and builds were pushed from peoples' local machines -often contaminated by other stuff they were working on. Art content live on our CDN may have had source (PSD/FLA) distributed among a dozen artist machines, or else the source for it was completely lost.

That was just the technical side. The business/management side was and is actually more hilarious. I have enough stories from that place to fill a hundred posts, but you can probably get a pretty good idea by imagining a yogurt-salesman-cum-CEO, his disbarred ebay art fraudster partner, and other friends directing the efforts of senior software engineers, artists, and other game developers. It was a god damn sitcom every day. Not to mention all of the labor law violations. Post-acquisition is a whole 'nother anthology of tales of hilarious incompetence. I should write a book.

I recall having lunch with the author when he asked me "What should I do?". I told him that he should leave. In hindsight, it might have been the best advice I ever gave.

[+] Morendil|13 years ago|reply
So the person who made a split-second mistake while doing his all for the business was pressured into resigning - basically, got fired.

What I want to know is what happened to whoever decided that backups were a dispensable luxury? In 2010?

There's a rule that appears in Jerry Weinberg's writings - the person responsible for a X million dollar mistake (and who should be fired over such a mistake) is whoever has controlling authority over X million dollars' worth of the company's activities.

A company-killing mistake should result in the firing of the CEO, not in that of the low-level employee who committed the mistake. That's what C-level responsibility means.

(I had the same thing happen to me in the late 1990's, got fired over it. Sued my employer, who opted to settle out of court for a good sum of money to me. They knew full well they had no leg to stand on.)

[+] lkrubner|13 years ago|reply
Klicknation is hiring. Of themselves, they say:

"We make astonishingly fun, ferociously addictive games that run on social networks. ...KlickNation boasts a team of extremely smart, interesting people who have, between them, built several startups (successful and otherwise); written a novel; directed music videos; run game fan sites; illustrated for Marvel Comics and Dynamite Entertainment with franchises like Xmen, Punisher, and Red Sonja; worked on hit games like Tony Hawk and X-Men games; performed in rock bands; worked for independent and major record lables; attended universities like Harvard, Stanford, Dartmouth, UC Berkeley; received a PhD and other fancy degrees; and built a fully-functional MAME arcade machine."

And this is hilarious: their "careers" page gives me a 404:

http://www.klicknation.com/careers/

That link to "careers" is from this page:

http://www.klicknation.com/contact/

I am tempted to apply simply to be able to ask them about this. It would be interesting to hear if they have a different version of this story, if it is all true.

[+] caseysoftware|13 years ago|reply
One of the things I like asking candidates is "Tell me about a time you screwed up so royally that you were sure you were getting fired."

Let's be honest, we all have one or two.. and if you don't, then your one or two are coming. It's what you learned to do differently that I care about.

And if you don't have one, you're either a) incredibly lucky, b) too new to the industry, or c) lying.

[+] kibwen|13 years ago|reply
"I found myself on the phone to Rackspace, leaning on a desk for support, listening to their engineer patiently explain that backups for this MySQL instance had been cancelled over 2 months ago."

Here's something I don't get: didn't Rackspace have their own daily backups of the production server, e.g. in case their primary facility was annihilated by a meteor (or some more mundane reason, like hard drive corruption)?

Regardless, here's a thought experiment: suppose that Rackspace did keep daily backups of every MySQL instance in their care, even if you're not paying for the backup service. Now suppose they get a frantic call from a client who's not paying for backups, asking if they have any. How much of a ridiculous markup would Rackspace need to charge to give the client access this unpaid-for backup, in order to make the back-up-every-database policy profitable? I'm guessing this depends on 1) the frequency of frantic phone calls, 2) the average size of a database that they aren't being paid to back up, and 3) the importance and irreplacebility of the data that they're handling (and 4) the irresponsibility of their major clients).

[+] laumars|13 years ago|reply
I really feel sorry for this guy. Accidents happen, which is why development happens in a sandboxed copy of the live system and why back ups are essential. It simple shouldn't be possible (or at least, that easy) for human error to put an entire company in jeopardy.

Take my own company, I've accidentally deleted /dev on development servers (not that major of an issue thanks to udev, but the timing of the mistake was lousy), a co-worker recently dropped a critical table on dev database and we've had other engineers break Solaris by carelessly punching in chmod -R / as root (we've since revised engineers permissions so this is no longer possible). As much as those errors are stupid and as much as engineers of our calibre should know better, it can only takes a minor lack of concentration at the wrong moment to make a major fsck up. Which is doubly scary when you consider how many interruptions the average engineer gets a day.

So I think the real guilt belongs to the entire technical staff as this is a cascade of minor fcsk ups that lead to something catastrophic.

[+] cantlin|13 years ago|reply
Last year I worked at a start-up that had manually created accounts for a few celebrities when they launched, in a gutsy and legally grey bid to improve their proposition†. While refactoring the code that handled email opt-out lists I missed a && at the end of a long conditional and failed to notice a second, otherwise unused opt-out system that dealt specifically with these users. It was there to ensure they really, really never got emailed. The result?

http://krugman.blogs.nytimes.com/2011/08/11/academia-nuts/

What a screw up!

These mistakes are almost without fail a healthy mix of individual incompetence and organisational failure. Many things - mostly my paying better attention to functionality I rewrite, but also the company not having multiple undocumented systems for one task, or code review, or automated testing - might have saved the day.

[†] They've long been removed.

[+] bambax|13 years ago|reply
Once, a long time ago, I spent the best part of a night writing a report for college, on an Amstrad PPC640 (http://en.wikipedia.org/wiki/PPC_512).

Once I was finished, I saved the document -- "Save" took around two minutes (which is why I rarely saved).

I had an external monitor that was sitting next to the PC; while the saving operation was under way, I decided I should move the monitor.

The power switch was on top of the machine (unusual design). While moving the monitor I inadvertently touched this switch and turned the PC of... while it was writing the file.

The file was gone, there was no backup, no previous version, nothing.

I had moved the monitor in order to go to bed, but I didn't go to bed that night. I moved the monitor back to where it was, and spent the rest of the night recreating the report, doing frequent backups on floppy disks, with incremental version names.

This was in 1989. I've never lost a file since.

[+] JohnBooty|13 years ago|reply
This happened to me once on a much smaller scale. Forgot the "where" clause on a DELETE statement. My screwup, obviously.

We actually had a continuous internal backup plan, but when I requested a restore, the IT guy told me they were backing up everything but the databases, since "they were always in use."

(Let that sink in for a second. The IT team actually thought that was an acceptable state of affairs: "Uh, yeah! We're backing up! Yeah! Well, some things. Most things. The files that don't get like, used and stuff.")

That day was one of the lowest feelings I ever had, and that screwup "only" cost us a few thousand dollars as opposed to the millions of dollars the blog post author's mistake cost the company. I literally can't imagine how he felt.

[+] islon|13 years ago|reply
I know how you felt. Many years ago when I was a junior working in a casual game company, I were to add a bunch of credits to a poker player (fake money). I forget the where in the SQL clause and added credits to every player in our database. Lucky me it was an add and not a set and I could revert it. Another time I was going to shutdown my pc (a debian box) using "shutdown -h now" and totally forgot that I was in a ssh session to our main game server. I had to call the tech support overseas and tell him to physically turn on the server...
[+] newishuser|13 years ago|reply
You did them more good than harm.

1) Not having backups is an excuse-less monumental fuckup.

2) Giving anyone delete access to your production db, especially a junior dev through a GUI tool, is an excuse-less monumental fuckup.

Hopefully they rectified these two problems and are now a stronger company for it.

[+] mootothemax|13 years ago|reply
If you ever notice that your employer or client isn't backing up important data, take a tip from me: do a backup, today, in your free time, and if possible, and again in your free time, create the most basic regular backup system you can.

When the time comes, and someone screws up, you will seem like a god when you deliver your backup, whether it's a 3-month-old one-off, or from your crappy daily backup system.

[+] danielh|13 years ago|reply
That is good advice, just make sure that it doesn't look like you are stealing your customers/clients data.
[+] rocky1138|13 years ago|reply
This.

Even if it's old, when you're facing no data or old data, old data looks like heaven.