We deleted the production database by accident

[+] skrebbel|5 years ago|reply

I'm appalled at the way some people here receive an honest postmortem of a human fuck-up. The top 3 comments, as I write this, can be summarized as "no, it's your fault and you're stupid for making the fault".

This is not good! We don't want to scare people into writing less of these. We want to encourage people to write more of them. An MBA style "due to a human error, we lost a day of your data, we're tremendously sorry, we're doing everything in our power yadayada" isn't going to help anybody.

Yes, there's all kinds of things they could have done to prevent this from happening. Yes, some of the things they did (not) do were clearly mistakes that a seasoned DBA or sysadmin would not make. Possibly they aren't seasoned DBAs or sysadmins. Or they are but they still made a mistake.

This stuff happens. It sucks, but it still does. Get over yourselves and wish these people some luck.

[+] t0mas88|5 years ago|reply

The software sector needs a bit of aviation safety culture: 50 years ago the conclusion "pilot error" as the main cause was virtually banned from accident investigation. The new mindset is that any system or procedure where a single human error can cause an incident is a broken system. So the blame isn't on the human pressing the button, the problem is the button or procedure design being unsuitable. The result was a huge improvement in safety across the whole industry.

In software there is still a certain arrogance of quickly calling the user (or other software professional) stupid, thinking it can't happen to you. But in reality given enough time, everyone makes at least one stupid mistake, it's how humans work.

[+] ganafagol|5 years ago|reply

It's good to have a post mortem. But this was not actually a post mortem. They still don't know how it could happen. Essentially, how can they write "We’re too tired to figure it out right now." and right after attempt to answer "What have we learned? Why won’t this happen again?" Well obviously you have not learned the key lesson yet since you don't know what it is! And how can you even dream of claiming to guarantee that it won't happen again before you know the root cause?

Get some sleep, do a thorough investigation, and the results of that are the post mortem that we would like published and where you learn from.

Publishing some premature thoughts without actual insight is not helping anybody. It will just invite the hate that you are seeing in this thread.

[+] ordu|5 years ago|reply

> I'm appalled at the way some people here receive an honest postmortem of a human fuck-up. The top 3 comments, as I write this, can be summarized as "no, it's your fault and you're stupid for making the fault".

It seems that people annoyed mostly by "complexity gremlins". They are so annoyed, that they miss previous sentence "we’re too tired to figure it out right now." Guys fucked up their system, they restored it the best they could, they tried to figure out what happened, but failed. So they decided to do PR right now, to explain what they know, and to continue the investigation later.

But people see just "complexity gremlins". The lesson learned is do not try any humor in a postmortem. Be as serious, grave, and dull as you can.

[+] rawgabbit|5 years ago|reply

For me, this is an example of DevOps being carried too far.

What is to stop developers for checking into Github "drop database; drop table; alter index; create table; create database; alter permission;"? They are automating environment builds and so that is more efficient right? In my career, I have seen a Fortune 100 company's core system down and out for a week because of hubris like this. In large companies, data flows downstream from a core system. When you have to restore from backup, that cascades into restores in all the child systems.

Similarly, I once had to convince a Microsoft Evangelist who was hired into my company, not to redeploy our production database, every-time we had a production deployment. He was a pure developer and did not see any problems of dropping the database, recreating the database, and re-inserting all the data. I argued that a) this would take 10+ hours b) the production database has data going back many years and that the schema/keys/rules/triggers have evolved during that time-- meaning that many of the inserts would fail because they didn't meet the current schema. He was unconvinced but luckily my bosses overruled him.

My bosses were business types and understood accounting. In accounting, once you "post" a transaction to the ledger that becomes permanent. If you need to correct that transaction, then you create a new one that "credits" or corrects the entry. You don't take out the eraser.

[+] bromuro|5 years ago|reply

I think you should wait 10+ hours to read different kind of comments on HN.

For example, if i open the comments about a “14 hours ago” post, I usually see a top comment about other comments (like yours).

I then feel so out of the loop because i don’t see the “commenters” your are referring too - so the thread that follows seem off topic to me.

[+] caspii|5 years ago|reply

Thanks

[+] qz2|5 years ago|reply

I disagree.

Culturally speaking we like to pat people on their back when they do something stupid and comfort them. But most of the time this isn’t productive because it doesn’t instil the requisite fear required when working out what decision to make.

What happens is we have growing complacency and disassociation from consequences.

Do you press the button on something potentially destructive because your are confident it is ok through analysis, good design and testing or confidence it is ok through trite complacency?

The industry is mostly the latter and it has to stop. And the first thing is calling bad processes, bad software and stupidity out for what it is.

Honestly these guys did good but most will try and hide this sort of fuck up or explain it away with weasel words.

[+] michelpp|5 years ago|reply

> Computers are just too complex and there are days when the complexity gremlins win.

I'm sorry for your data loss, but this is a false and dangerous conclusion to make. You can avoid this problem. There are good suggestions in this thread, but I suggest you use Postgres's permission system to REVOKE DROP action on production except for a very special user that can only be logged in by a human, never a script.

And NEVER run your scripts or application servers as a superuser. This is a dangerous antipattern embraced by many and ORM and library. Grant CREATE and DROP to non-super users.

[+] sushshshsh|5 years ago|reply

As a mid level developer contributing to various large corporate stacks, I would say the systems are too complex and it's too easy to break things in non obvious ways.

Gone are the days of me just being able to run a simple script that accesses data read only an exports the result elsewhere as an output.

[+] auroranil|5 years ago|reply

Tom Scott made a mistake with a similar outcome as this article, but with an SQL query that is much more subtle than DROP.

https://www.youtube.com/watch?v=X6NJkWbM1xk

By all means, find ways to fool-proof the architecture. But be prepared for scenarios where some destructive action happens to a production database.

[+] thih9|5 years ago|reply

> You can avoid this problem.

The article isn’t claiming that the problem is impossible to solve.

On the contrary: “However, we will figure out what went wrong and ensure that this particular error doesn’t happen again.”.

[+] DelightOne|5 years ago|reply

If you use terraform to deploy the managed production database, do you use the postgresql terraform provider to create roles or are you creating them manually?

[+] bsder|5 years ago|reply

> You can avoid this problem.

No, you can't. No matter how good you are, you can always "rm -rf" your world.

Yes, we can make it harder, but, at the end of the day, some human, somewhere, has to pull the switch on the stuff that pushes to prod.

You can clobber prod manually, or you accidentally write an erroneous script that clobbers prod. Either way--prod is toast.

The word of the day is "backups".

[+] centimeter|5 years ago|reply

> but this is a false and dangerous conclusion to make

Until we get our shit together and start formally verifying the semantics of everything, their conclusion is 100% correct, both literally and practically.

[+] oppositelock|5 years ago|reply

You have to put a lot of thought into protecting and backing up production databases, and backups are not good enough without regular testing of recovery.

I have been running Postgres in production supporting $millions in business for years. Here's how it's set up. These days I use RDS in AWS, but the same is doable anywhere.

First, the primary server is configured to send write ahead logs (WAL) to a secondary server. What this means is that before a transaction completes on the master, she slave has written it too. This is a hot spare in case something happens to the master.

Secondly, WAL logs will happily contain a DROP DATABASE in them, they're just the transaction log, and don't prevent bad mistakes, so I also send the WAL logs to backup storage via WAL-E. In the tale of horror in the linked article, I'd be able to recover the DB by restoring from the last backup, and applying the WAL delta. If the WAL contains a "drop database", then some manual intervention is required to only play them back up to the statement before that drop.

Third is a question of access control for developers. Absolutely nobody should have write credentials for a prod DB except for the prod services. If a developer needs to work with data to develop something, I have all these wonderful DB backups lying around, so I bring up a new DB from the backups, giving the developer a sandbox to play in, and also testing my recovery procedure, double-win. Now, there are emergencies where this rule is broken, but it's an anomalous situation handled on a case by case basis, and I only let people who know what they're doing touch that live prod DB.

[+] azeirah|5 years ago|reply

Quick tip for anyone learning from this thread.

If you're using MySQL, it's called a binary log and not a Write Ahead Log, it was very difficult to find meaningful Google results for "MySQL WAL"

[+] x87678r|5 years ago|reply

Interesting, I immediately thought they would have a transaction log, I didn't think it would have the delete as well.

Its a real problem that we used to have trained DBAs to own the data where now devs and automatic tools are relied upon, there isn't a culture or toolset built up yet to handle it.

[+] mr_toad|5 years ago|reply

> I have all these wonderful DB backups lying around, so I bring up a new DB from the backups

It’s nice to have that capability, but some databases are just too big to have multiple copies lying around, or to able to create a sandbox for everyone.

[+] danellis|5 years ago|reply

> after a couple of glasses of red wine, we deleted the production database by accident

> It’s tempting to blame the disaster on the couple of glasses of red wine. However, the function that wiped the database was written whilst sober.

It was _written_ then, but you're still admitting to the world that your employees do work on production systems after they've been drinking. Since they were working so late, one might think this was emergency work, but it says "doing some late evening coding". I think this really highlights the need to separate work time from leisure time.

[+] aszen|5 years ago|reply

I had a narrow escape once doing something fancy with migrations.

We had several MySQL string columns as long text type in our database but they should have been varchar(255) or so. So I was assigned to convert these columns to their appropriate size.

Being the good developer I was, I decided to download a snapshot of the prod database locally and checked the maximum string length we had for each column via a script. Using this script it made a migration query that would alter column types to match their maximum used length keeping the minimum length as varchar (255).

I tested that migration and everything looked good, it passed code review and was run on prod. Soon after we start getting complaints from users that their old email texts have been truncated. I then realize the stupidity of the whole thing, the local dump of production database always wiped out many columns clean for privacy like the email body column. So the script thought it had max length of 0 and decided to convert the column to varchar(255).

I realize the whole thing may look incredibly stupid, that's only because the naming for db columns was in a foreign european language so I didn't know even know the semantics of each column.

Thankfully my seniors managed to restore that column and took the responsibility themselves since they had passed the review.

We still did fix those unusually large columns but this time by simple duplicate alter queries for each of those columns instead of using fancy scripts.

I think a valuable lesson was learned that day to not rely on hacky scripts just to reduce some duplicate code.

I now prefer clarity and explicitness when writing such scripts instead of trying to be too clever and automating everything.

[+] heavenlyblue|5 years ago|reply

And you didn’t even bother to do a query of the actual maximum length value of the columns you were mutating? Or at least query and see the text in there?

Basically you just blindly ran the migration on the data and checked if it didn’t fail?

The lesson here is not about cleverness unfortunately.

[+] john_moscow|5 years ago|reply

Just my 2 cents. I run a small software business that involves a few moderately-sized databases. The day I moved from a fully managed hosting to a Linux VPS, I have crontabbed a script like this to run several times a day:

    for db in `mysql [...] | grep [...]`
    do
        mysqldump [...] > $db.sql
    done
    
    git commit -a -m "Automatic backup"
    git push [backup server #1]
    git push [backup server #2]
    git push [backup server #3]
    git gc

The remote git repos are configured with denyNonFastForwards and denyDeletes, so regardless of what happens to the server, I have a full history of what happened to the databases, and can reliably go back in time.

I also have a single-entry-point script that turns a blank Linux VM into a production/staging server. If your business is more than a hobby project and you're not doing something similar, you are sitting on a ticking time bomb.

[+] candiddevmike|5 years ago|reply

Anyone reading the above: please don't do this. Git is not made for database backups, use a real backup solution like WAL archiving or dump it into restic/borg. Your git repo will balloon at an astronomical rate, and I can't imagine why anyone would diff database backups like this.

[+] wolfgang000|5 years ago|reply

I don't believe having a massive repo with backups would be the ideal solution. Couldn't you just upload the backup to an s3 bucket instead?

[+] Ayesh|5 years ago|reply

This is what I do too.

The mysqldump command is tweaked to use individual INSERT clauses as opposed to one bulk one, so the diff hunks are smaller.

You can also sed and remove the mysqldump timestamp, so there will be no commits if there are no database changes, saving the git repo space.

[+] mgkimsal|5 years ago|reply

Any issues with the privacy aspect of that data that's stored in multiple git repos? PII and such?

[+] bufferoverflow|5 years ago|reply

You should really compress them instead of dumping them raw into Git. LZ4 or ZStandard are good.

[+] amingilani|5 years ago|reply

Happens to all of us. Once I required logs from the server. The log file was a few gigs and still in use. so I carefully duplicated it, grepped just the lines I needed into another file and downloaded the smaller file.

During this operation, the server ran out of memory—presumably because of all the files I'd created—and before I know it I'd managed to crash 3 services and corrupted the database—which was also on this host—on my first day. All while everyone else in the company was asleep :)

Over the next few hours, I brought the site back online by piecing commands together from the `.bash_history` file.

[+] xtracto|5 years ago|reply

This happened to me (someone in my team) a while ago but with mongo. The production database was ssh-tunneled to the default port of the guys computer and he ran tests that cleaned the database first.

Now... our scenario was such that we could NOT lose those 7 hours because each customer record lost meant $5000 usd penalty.

What saved us is that I knew about the oplog (binlog in mysql) so after restoring the backup i isolated the last N hours lost from the log and replayed it on the database.

Lesson learned and a lucky save.

[+] muststopmyths|5 years ago|reply

>Note that host is hardcoded to localhost. This means it should never connect to any machine other than the developer machine. We’re too tired to figure it out right now. The gremlins won this time.

Obviously, somehow the script ran on the database host.

some practices I've followed in the past to keep this kind of thing from happening:

* A script that deletes all the data can never be deployed to production.

* scripts that alter the DB rename tables/columns rather than dropping them (you write a matching rollback script ), for at least one schema upgrade cycle. you can always restore from backups, but this can make rollbacks quick when you spot a problem at deployment time.

* the number of people with access to the database in prod is severely restricted. I suppose this is obvious, so I'm curious how the particular chain of events in TFA happened.

[+] amluto|5 years ago|reply

I have a little metadata table in production that has a field that says “this is a production database”. The delete-everything script reads that flag via a SQL query that will error out of it’s set in the same transaction as the deletion. To prevent the flag from getting cleared in production, the production software stack will refuse to run if the “production” flag is not set.

[+] mcpherrinm|5 years ago|reply

The blog mentions it's a managed DigitalOcean database, so the script likely wasn't run on the host itself.

More likely, I'd suspect, is something like an SSH tunnel with port forwarding was running, perhaps as part of another script.

[+] StavrosK|5 years ago|reply

Someone SSHed to production and forwarded the database port to the local machine to run a report, then forgot about the connection and ran the deletion script locally.

[+] PeterisP|5 years ago|reply

One aspect that can help with this is separate roles/accounts for dangerous privileges.

I.e. if Alice is your senior DBA who would have full access to everything including deleting the main production database, then it does not mean that the user 'alice' should have the permission to execute 'drop database production' - if that needs to be done, she can temporarily escalate the permissions to do that (e.g. a separate account, or separate role added to the account and removed afterwards, etc).

Arguably, if your DB structure changes generally are deployed with some automated tools, then the everyday permissions of senior DBA/developer accounts in the production environment(s) should be read-only for diagnostics. If you need a structural change, make a migration and deploy it properly; if you need an urgent ad-hoc fix to data for some reason (which you hopefully shouldn't need to do very often), then do that temporary privilege elevation thing; perhaps it's just "symbolic" but it can't be done accidentally.

[+] jlgaddis|5 years ago|reply

> the number of people with access to the database in prod is severely restricted

And of those people, there should be an even fewer number with the "drop database" privilege on prod.

Also, from a first glance, it looks like using different database names and (especially!) credentials between the dev and prod environments would be a good idea too.

[+] unnouinceput|5 years ago|reply

Quote: "Note that host is hardcoded to localhost. This means it should never connect to any machine other than the developer machine. Also: of course we use different passwords and users for development and production. We’re too tired to figure it out right now.

The gremlins won this time."

No they didn't. Instead one of your gremlins ran this function directly on the production machine. This isn't rocket science, just the common sense conclusion. Now it would be a good time to check those auditing logs / access logs you're suppose to have them enabled on said production machine.

[+] colechristensen|5 years ago|reply

This is bad operations.

That it happened meant that there were many things wrong with the architecture, and summing up the problem to “these things happen” is irresponsible, most importantly your response to a critical failure needs to be in the mindset of figuring out how you would have prevented the error without knowing it was going to happen and doing so in several redundant ways.

Fixing the specific bug does almost nothing for your future reliability.

[+] cblconfederate|5 years ago|reply

> Computers are just too complex and there are days when the complexity gremlins win.

Wow. But then again it's not like programmers handle dangerous infrastructure like trucks, military rockets or nuclear power plants. Those are reserved for adults

[+] ricksharp|5 years ago|reply

Are you sure it was the production database that was affected?

If you are not sure how a hard coded script that was targeting localhost affected a production database, how do you know you were even viewing the production database as the one dropped?

Maybe you were simply connected to the wrong database server?

I’ve done that many times - where I had an initial “oh no“ moment and then realized I was just looking at the wrong thing, and everything was ok.

I’ve also accidentally deployed a client website with the wrong connection string and it was quite confusing.

In an even more extreme case: I had been deploying a serverless stack to the entirely wrong aws account - I thought I was using an aws named profile and I was actually using the default (which changed when I got a new desktop system). I.e. aws cli uses —profile flag, but serverless cli uses —aws-profile flag. (Thankfully this all happened during development.)

I now have deleted default profiles from my aws config.

[+] jrochkind1|5 years ago|reply

The lack of the seriousness/professionalism of the postmortem seemed odd to me too. So, okay, what is this site?

> KeepTheScore is an online software for scorekeeping. Create your own scoreboard for up to 150 players and start tracking points. It's mostly free and requires no user account.

And also:

> Sat Sep 5, 2020, Running Keepthescore.co costs around 171 USD each month, whilst the revenue is close to zero (we do make a little money by building custom scoreboards now and then). This is an unsustainable situation which needs to be fixed – we hope this is understandable! To put it another way: Keepthescore.co needs to start making money to continue to exist.

https://keepthescore.co/blog/posts/monetizing-keepthescore/

So okay, it's basically a hobby site, for a service that most users probably won't really mind losing 7 hours of data, and that has few if any paying customers.

That context makes it make a little bit more sense.

[+] emodendroket|5 years ago|reply

This post is embarrassing. "yeah we were drinking and accidentally nuked the prod DB. Not sure why. Shit happens!" Who would read this and think they should trust this company? Any number of protections could have been taken to prevent this and production access in any state other than fully alert and attentive shouldn't happen unless it is absolutely necessary for emergency reasons

[+] bstar77|5 years ago|reply

I think it's kind of funny they chose to post this story rather than do a typical post mortem.

[+] corobo|5 years ago|reply

This reply is embarrassing. It's a person working on their side project. Have a glass of wine mate.

[+] tcbasche|5 years ago|reply

Yeah why should I treat anything this company does with any level of seriousness? Why should anyone?

It's lucky it's just some online scoreboard because I'm sure as shit this stuff has happened before with more critical systems and it scares the hell out of me that engineers are fine blaming "gremlins" instead of taking responsibility for their own incompetence.

[+] mbroshi|5 years ago|reply

I love this post. This sort of thing happens to everyone, most people just are not willing to be so open about it.

I was once sshed to the production server, and was cleaning up some old files that got created by an errant script, one which file was '~'. So, to clean it up, I type `rm -rf ~`.

[+] heelix|5 years ago|reply

Ah man, these things happen. One of our developers - very new to elastic - was asked to modify some indexes. Folks were a bit too busy to help or heading out on holiday. One stack overflow answer later... delete and recreate it... and she was off to the races. When the test was tried, it looked like things still worked. A quick script did the same to stage and prod, in both data centers. Turns out that is not a great way to go about it. It deleted the documents. We got lucky, as we still had not killed off the system we were migrating off of and it only took three days of turn and burn to get the data back on the system.

So many lessons learned that day. I trust her with the master keys at this point, as nobody is more careful with production than her now. :)

[+] fideloper|5 years ago|reply

RDS is so very worth paying for this type of issue (in many cases, obviously $60 to multiple thousands a month isn’t great for everything).

Otherwise having a binlog based backup (or WAL, I guess, but i don’t know PG that well) is critical.

The key point there is they provide point in time recovery possibilities (and even the ability to rewrite history).

[+] lysp|5 years ago|reply

I had a client who had prod database access due to it being hosted internally. They called up saying "their system is no longer working".

After about an hour of investigation, I find one of the primary database tables is empty - completely blank.

I then spend the next hour looking through code to see if there's any chance of a bug that would wipe their data and couldn't find anything that would do that.

I then had to make "the phone call" to the client saying that their primary data table had been wiped and I didn't know what we did wrong.

Their response: "Oh I wrote a query and accidentally did that, but thought I stopped it".

[+] dvdbloc|5 years ago|reply

At my job, the company computers are configured to send “localhost” to the local company DNS servers, of which they happily reply with the IP address of the last machine that’s gotten a DHCP lease with the hostname “localhost”. Which happens often. Needless to say, our IT dept isn’t the best.

426 comments