Let's Stop Talking About "Backups"

[+] gaius|16 years ago|reply

It's not us you need to tell Joel, it's your business partner. And if this is your way of telling him, isn't it a little passive-aggressive to do it in a blog post?

[+] spolsky|16 years ago|reply

I think that a lot of people think they have backups, but they've never restored them, so I thought it would be a good practice if everyone starting thinking in terms of "have we restored" rather than "are we backing up." Of course, there's no question that the thought process started when Jeff Atwood's personal blog was lost, but don't think for a minute that the only way I communicate with him is by blogging... we talk all the time, over skype, over email, over FogBugz, and sometimes, when there's something other people can learn, in public on the internet.

[+] mechanical_fish|16 years ago|reply

Unless I've missed further developments in the story, his business partner trusted the "backups" of the hosting company. Which did not properly restore.

So this is aimed one step farther up the chain.

Or, rather, it's aimed at all of us. It is a lesson that a lot of people need to learn.

[+] jseifer|16 years ago|reply

I think this was directed more at his business partners hosting company.

[+] unknown|16 years ago|reply

[deleted]

[+] DenisM|16 years ago|reply

I find it disturbing how many people decided to comment or up-vote on Joel-Atwood angle, or insult each other.

Remember: great people discuss ideas, normal people discuss events, shallow people discuss other people.

[+] ghshephard|16 years ago|reply

What I find someone charming/quaint about Joel's Posts on "Operations" over the last six years that I've been reading him, is that he is, slowly but surely, discovering the "Art of Operations" - albeit at a glacial pace.

Most People who work in Production Operations environments of any scale, discover what has taken Joel the better part of a decade, in the first two-three years of their career.

I almost feel like that Airplane passenger sitting beside Brooks Jr - Brooks saw him reading his book, "Mythical Man Month" - and asked the guy (who had no idea who he was sitting beside) what he thought of the book - The gentleman responded that it was basically a summary of things he knew already. Joel is a giant in the industry, but he does have a tendency to discover/restate the obvious.

"It's not backups, but the restores that matter" - is kind of the mantra of every single person who has ever been responsible for backups.

Then, you go to _any_ class on running a production environment, and you discover things like RPO, RTO, Dress-Rehearsals, etc.. and the whole "It's restores that matter" begins to look quaint.

[+] theBobMcCormick|16 years ago|reply

What I find amusing about the whole thing is that it's like a microcosm representation of how developers always think operations is trivial and unimportant..... until they have to do it themselves! :-)

[+] josephkern|16 years ago|reply

Totally agree on these points. The Joel-Atwood experience is not that of two programmers starting a company. It's a story of two programmers learning about System Administration.

[+] lifeisstillgood|16 years ago|reply

the "Art of Operations" suggests a book or similar that you are referring to - is there one, or am I reading too much into some double quotes?

[+] wglb|16 years ago|reply

This is good food for thought. Let's also add to this a concept from a different realm: Everybody has at least two dns servers listed in their /etc/resolv.conf, right? The reason is that in case one of them goes down, there is the other one.

So this seems like a good lesson to take about backups. Mebbe three? One by your hosting provider, one at tarsnap, one on a separate dat tape, one on a usb stick?

A good point is though that even something as big as a dat tape looks pretty small by the standards of what we need to back up today.

[+] Sukotto|16 years ago|reply

Also this, similar take on the horror of backups gone wrong: http://www.penny-arcade.com/comic/2005/8/10/

Don't blindly rely on your partner to do it... Trust, but verify.

[+] Goladus|16 years ago|reply

It's helpful to boost signal for this message. It's an old message, but planning and practicing system restores can be as expensive in terms of equipment and manpower as actually making the backups. This leads to a lot of neglect.

[+] loupgarou21|16 years ago|reply

... shouldn't it be common practice to test your backup system to make sure that the restore procedure meets the requirements of the client (company, etc?)

The IT company that I work for creates a backup system based on the requirements of our clients and then demonstrates the whole backup and restore procedure to make sure that it falls in line with what our client actually wants. It's really not difficult to do. Sure, some of the restore procedures may be slower (depending on other requirements, such as cost,) but the client knows that will be the case and signs off on it.

[+] mark_l_watson|16 years ago|reply

Common sense and obvious. In the 1980s I worked on a large DARPA project where a huge hit was taken because our admins never tried to restore from backups. It is the kind of lesson that is (hopefully) learned with just one bad experience.

This is another reason why I like EC2 deployments: it is fairly easy to take your backups (automated deployment scripts, application, data) and spin up another copy of your whole system (except for flipping the DNS). Make sure those EBS-backed EC2 AMIs are really bootable and functioning :-)

[+] DenisM|16 years ago|reply

What happens if Amazon runs out of machines, has a system-wide corruption that was noticed too late or decides to be dicks about something or other?

[+] hexis|16 years ago|reply

I tend to be a little paranoid about backups, so I have a few different disks backing up my main, desktop, machine. But I also use one of my backups to sync data to my laptop, not quite a full "restore" due to a big size difference in the respective drives. But, generally speaking, the two machines are in sync and I can be sure that at least one of my backups works reasonably well.

[+] DannoHung|16 years ago|reply

This whole ordeal is getting me motivated to actually buy a cloud backup service (personal use, not business use). I was thinking of Carbonite or Backblaze. Anyone have any experience with those?

[+] slig|16 years ago|reply

I used carbonite in early 2006 and the software was horrible. Eventually, I tried to uninstall and the process failed. I got a half installation that wouldn't work and couldn't be removed. I didn't try too hard after that, because I was planning to format and start over.

[+] jsz0|16 years ago|reply

Full disk image backups are a good solution for this problem. No worries about partial backups or a complex restoration process. It's totally inefficient but storage is cheaper than man hours.

[+] peterwwillis|16 years ago|reply

I'm assuming you're talking about some kind of atomic file-or-block-level backup such as LVM snapshots? Large files such as databases can change while reading them over a long period of time, so a standard disk image or file copy wouldn't be reliable for a live system.

[+] orblivion|16 years ago|reply

As if I didn't lose enough sleep pondering backups already.

[+] lazyant|16 years ago|reply

Testing your (off-site) backups is an obvious first item in many lists http://watsec.com/article/49

[+] dnsworks|16 years ago|reply

What he's saying is, "We failed miserably at having good process and procedures. Because of this we are going to lecture others, and point the blame at everybody but ourselves, in hopes that they'll stop pointing out how much credibility we lost over this."

[+] aw3c2|16 years ago|reply

Pointless write-up about linguistics. He says "restore" is the important thing, not the "backup". Well, duh.

[+] itgoon|16 years ago|reply

Not really. That's actually one of my favorite "health check" questions: when was the last time you restored?

Most places have very reliable backup procedures. Most of those have very poor restore procedures - I'd say about half fail when put to the test.

[+] michael_dorfman|16 years ago|reply

If you think this post is about linguistics, you're missing the point

Since the restore is the important thing, that's the one you have to test. And if you haven't tested restoring, your backup is (quite possibly) worthless.

[+] flogic|16 years ago|reply

It's a well duh, but still most people fail to test restore. This may be the greatest thing about distributed version control. Every clone is a restore. A limited restore but still a restore.

[+] jfoutz|16 years ago|reply

backups are for suckers. keep the data on a few different spinning disks. if you can solve data synch between two sites, just keep your data synched.

it's much better to ask yourself how long to replicate your existing system then how to back up. pxe boot to a kernel that you can install over the network with, bcfg2 to get the thing up to spec, start copying data.

a lot of machines can be back and configured in 5 minutes.

that said, i'm not you. i don't have terabytes of data to do statistics on. maybe there are other horrible details i'm forgetting. fast rebuilding is a pretty awesome strategy for a lot of cases.

[+] mechanical_fish|16 years ago|reply

maybe there are other horrible details i'm forgetting

Yes.

Why should I bother to write this? I'll outsource the task to the authors of High Performance MySQL, Second Edition, page 475:

Backup Myth #1: "I Use Replication As a Backup"

This is a mistake we see quite often. A replication slave is not a backup. Neither is a RAID array. To see why, consider this: will they help you get back all your data if you accidentally execute DROP DATABASE on your production database? RAID and replication don't pass even this simple test. Not only are they not backups, they're not a substitute for backups. Nothing but backups fill the need for backups.

[+] idlewords|16 years ago|reply

Replication doesn't defend you against deleting the wrong file, or messing up an update to your database, or any of a large class of PEBKAC issues.

It's also not useful if your main files get corrupted and you diligently propagate the corruption. See: ma.gnolia

[+] spolsky|16 years ago|reply

What if you hit rm -rf * and that gets synched?

I don't mean to be rude, I don't know anything about you, but if you were a system administrator working for me, today would be your last day.

[+] gaius|16 years ago|reply

if you can solve data synch between two sites, just keep your data synched.

LOL! And if you get a corrupt block on your primary site, what're you going to do? All your standbys are instantly tainted!

Better leave this one to the grownups.

[+] jws|16 years ago|reply

Replication is not backup. It only protects against hardware failure.

Spinning disks are good though. A lovely spot for backups. Just put them in a different building.

Edit: In the time it took me to write this 5 other people also lambasted this poor fellow. Ouch.

[+] apowell|16 years ago|reply

A real-time mirror won't help if your data gets corrupted (via a fat-fingered shell command, errant script, or a compromised system). When it's appropriate, it's awesome, but I don't think it's a replacement for a static "offline" backup,

[+] InclinedPlane|16 years ago|reply

Sorry, this is not only WRONG but actually HARMFUL advice.

Redundancy is not a backup. If someone with full admin control to your system can destroy all of your data then you do not have backups. A proper backup is physically separate from your primary data and, preferably, can't be destroyed with mere admin access to the system. The number of sites that have had catastrophic data loss due to relying on mirroring instead of true backups is quite significant.

82 comments