It's not us you need to tell Joel, it's your business partner. And if this is your way of telling him, isn't it a little passive-aggressive to do it in a blog post?
I think that a lot of people think they have backups, but they've never restored them, so I thought it would be a good practice if everyone starting thinking in terms of "have we restored" rather than "are we backing up." Of course, there's no question that the thought process started when Jeff Atwood's personal blog was lost, but don't think for a minute that the only way I communicate with him is by blogging... we talk all the time, over skype, over email, over FogBugz, and sometimes, when there's something other people can learn, in public on the internet.
Unless I've missed further developments in the story, his business partner trusted the "backups" of the hosting company. Which did not properly restore.
So this is aimed one step farther up the chain.
Or, rather, it's aimed at all of us. It is a lesson that a lot of people need to learn.
What I find someone charming/quaint about Joel's Posts on "Operations" over the last six years that I've been reading him, is that he is, slowly but surely, discovering the "Art of Operations" - albeit at a glacial pace.
Most People who work in Production Operations environments of any scale, discover what has taken Joel the better part of a decade, in the first two-three years of their career.
I almost feel like that Airplane passenger sitting beside Brooks Jr - Brooks saw him reading his book, "Mythical Man Month" - and asked the guy (who had no idea who he was sitting beside) what he thought of the book - The gentleman responded that it was basically a summary of things he knew already. Joel is a giant in the industry, but he does have a tendency to discover/restate the obvious.
"It's not backups, but the restores that matter" - is kind of the mantra of every single person who has ever been responsible for backups.
Then, you go to _any_ class on running a production environment, and you discover things like RPO, RTO, Dress-Rehearsals, etc.. and the whole "It's restores that matter" begins to look quaint.
What I find amusing about the whole thing is that it's like a microcosm representation of how developers always think operations is trivial and unimportant..... until they have to do it themselves! :-)
Totally agree on these points. The Joel-Atwood experience is not that of two programmers starting a company. It's a story of two programmers learning about System Administration.
This is good food for thought. Let's also add to this a concept from a different realm: Everybody has at least two dns servers listed in their /etc/resolv.conf, right? The reason is that in case one of them goes down, there is the other one.
So this seems like a good lesson to take about backups. Mebbe three? One by your hosting provider, one at tarsnap, one on a separate dat tape, one on a usb stick?
A good point is though that even something as big as a dat tape looks pretty small by the standards of what we need to back up today.
It's helpful to boost signal for this message. It's an old message, but planning and practicing system restores can be as expensive in terms of equipment and manpower as actually making the backups. This leads to a lot of neglect.
... shouldn't it be common practice to test your backup system to make sure that the restore procedure meets the requirements of the client (company, etc?)
The IT company that I work for creates a backup system based on the requirements of our clients and then demonstrates the whole backup and restore procedure to make sure that it falls in line with what our client actually wants. It's really not difficult to do. Sure, some of the restore procedures may be slower (depending on other requirements, such as cost,) but the client knows that will be the case and signs off on it.
Common sense and obvious. In the 1980s I worked on a large DARPA project where a huge hit was taken because our admins never tried to restore from backups. It is the kind of lesson that is (hopefully) learned with just one bad experience.
This is another reason why I like EC2 deployments: it is fairly easy to take your backups (automated deployment scripts, application, data) and spin up another copy of your whole system (except for flipping the DNS). Make sure those EBS-backed EC2 AMIs are really bootable and functioning :-)
I tend to be a little paranoid about backups, so I have a few different disks backing up my main, desktop, machine. But I also use one of my backups to sync data to my laptop, not quite a full "restore" due to a big size difference in the respective drives. But, generally speaking, the two machines are in sync and I can be sure that at least one of my backups works reasonably well.
This whole ordeal is getting me motivated to actually buy a cloud backup service (personal use, not business use). I was thinking of Carbonite or Backblaze. Anyone have any experience with those?
I used carbonite in early 2006 and the software was horrible. Eventually, I tried to uninstall and the process failed. I got a half installation that wouldn't work and couldn't be removed. I didn't try too hard after that, because I was planning to format and start over.
Full disk image backups are a good solution for this problem. No worries about partial backups or a complex restoration process. It's totally inefficient but storage is cheaper than man hours.
I'm assuming you're talking about some kind of atomic file-or-block-level backup such as LVM snapshots? Large files such as databases can change while reading them over a long period of time, so a standard disk image or file copy wouldn't be reliable for a live system.
What he's saying is, "We failed miserably at having good process and procedures. Because of this we are going to lecture others, and point the blame at everybody but ourselves, in hopes that they'll stop pointing out how much credibility we lost over this."
If you think this post is about linguistics, you're missing the point
Since the restore is the important thing, that's the one you have to test. And if you haven't tested restoring, your backup is (quite possibly) worthless.
It's a well duh, but still most people fail to test restore. This may be the greatest thing about distributed version control. Every clone is a restore. A limited restore but still a restore.
backups are for suckers. keep the data on a few different spinning disks. if you can solve data synch between two sites, just keep your data synched.
it's much better to ask yourself how long to replicate your existing system then how to back up. pxe boot to a kernel that you can install over the network with, bcfg2 to get the thing up to spec, start copying data.
a lot of machines can be back and configured in 5 minutes.
that said, i'm not you. i don't have terabytes of data to do statistics on. maybe there are other horrible details i'm forgetting. fast rebuilding is a pretty awesome strategy for a lot of cases.
maybe there are other horrible details i'm forgetting
Yes.
Why should I bother to write this? I'll outsource the task to the authors of High Performance MySQL, Second Edition, page 475:
Backup Myth #1: "I Use Replication As a Backup"
This is a mistake we see quite often. A replication slave is not a backup. Neither is a RAID array. To see why, consider this: will they help you get back all your data if you accidentally execute DROP DATABASE on your production database? RAID and replication don't pass even this simple test. Not only are they not backups, they're not a substitute for backups. Nothing but backups fill the need for backups.
A real-time mirror won't help if your data gets corrupted (via a fat-fingered shell command, errant script, or a compromised system). When it's appropriate, it's awesome, but I don't think it's a replacement for a static "offline" backup,
Sorry, this is not only WRONG but actually HARMFUL advice.
Redundancy is not a backup. If someone with full admin control to your system can destroy all of your data then you do not have backups. A proper backup is physically separate from your primary data and, preferably, can't be destroyed with mere admin access to the system. The number of sites that have had catastrophic data loss due to relying on mirroring instead of true backups is quite significant.
[+] [-] gaius|16 years ago|reply
[+] [-] spolsky|16 years ago|reply
[+] [-] mechanical_fish|16 years ago|reply
So this is aimed one step farther up the chain.
Or, rather, it's aimed at all of us. It is a lesson that a lot of people need to learn.
[+] [-] jseifer|16 years ago|reply
[+] [-] unknown|16 years ago|reply
[deleted]
[+] [-] DenisM|16 years ago|reply
Remember: great people discuss ideas, normal people discuss events, shallow people discuss other people.
[+] [-] ghshephard|16 years ago|reply
Most People who work in Production Operations environments of any scale, discover what has taken Joel the better part of a decade, in the first two-three years of their career.
I almost feel like that Airplane passenger sitting beside Brooks Jr - Brooks saw him reading his book, "Mythical Man Month" - and asked the guy (who had no idea who he was sitting beside) what he thought of the book - The gentleman responded that it was basically a summary of things he knew already. Joel is a giant in the industry, but he does have a tendency to discover/restate the obvious.
"It's not backups, but the restores that matter" - is kind of the mantra of every single person who has ever been responsible for backups.
Then, you go to _any_ class on running a production environment, and you discover things like RPO, RTO, Dress-Rehearsals, etc.. and the whole "It's restores that matter" begins to look quaint.
[+] [-] theBobMcCormick|16 years ago|reply
[+] [-] josephkern|16 years ago|reply
[+] [-] lifeisstillgood|16 years ago|reply
[+] [-] wglb|16 years ago|reply
So this seems like a good lesson to take about backups. Mebbe three? One by your hosting provider, one at tarsnap, one on a separate dat tape, one on a usb stick?
A good point is though that even something as big as a dat tape looks pretty small by the standards of what we need to back up today.
[+] [-] Sukotto|16 years ago|reply
Don't blindly rely on your partner to do it... Trust, but verify.
[+] [-] Goladus|16 years ago|reply
[+] [-] loupgarou21|16 years ago|reply
The IT company that I work for creates a backup system based on the requirements of our clients and then demonstrates the whole backup and restore procedure to make sure that it falls in line with what our client actually wants. It's really not difficult to do. Sure, some of the restore procedures may be slower (depending on other requirements, such as cost,) but the client knows that will be the case and signs off on it.
[+] [-] mark_l_watson|16 years ago|reply
This is another reason why I like EC2 deployments: it is fairly easy to take your backups (automated deployment scripts, application, data) and spin up another copy of your whole system (except for flipping the DNS). Make sure those EBS-backed EC2 AMIs are really bootable and functioning :-)
[+] [-] DenisM|16 years ago|reply
[+] [-] hexis|16 years ago|reply
[+] [-] DannoHung|16 years ago|reply
[+] [-] slig|16 years ago|reply
[+] [-] jsz0|16 years ago|reply
[+] [-] peterwwillis|16 years ago|reply
[+] [-] orblivion|16 years ago|reply
[+] [-] lazyant|16 years ago|reply
[+] [-] dnsworks|16 years ago|reply
[+] [-] aw3c2|16 years ago|reply
[+] [-] itgoon|16 years ago|reply
Most places have very reliable backup procedures. Most of those have very poor restore procedures - I'd say about half fail when put to the test.
[+] [-] michael_dorfman|16 years ago|reply
Since the restore is the important thing, that's the one you have to test. And if you haven't tested restoring, your backup is (quite possibly) worthless.
[+] [-] flogic|16 years ago|reply
[+] [-] jfoutz|16 years ago|reply
it's much better to ask yourself how long to replicate your existing system then how to back up. pxe boot to a kernel that you can install over the network with, bcfg2 to get the thing up to spec, start copying data.
a lot of machines can be back and configured in 5 minutes.
that said, i'm not you. i don't have terabytes of data to do statistics on. maybe there are other horrible details i'm forgetting. fast rebuilding is a pretty awesome strategy for a lot of cases.
[+] [-] mechanical_fish|16 years ago|reply
Yes.
Why should I bother to write this? I'll outsource the task to the authors of High Performance MySQL, Second Edition, page 475:
Backup Myth #1: "I Use Replication As a Backup"
This is a mistake we see quite often. A replication slave is not a backup. Neither is a RAID array. To see why, consider this: will they help you get back all your data if you accidentally execute DROP DATABASE on your production database? RAID and replication don't pass even this simple test. Not only are they not backups, they're not a substitute for backups. Nothing but backups fill the need for backups.
[+] [-] idlewords|16 years ago|reply
It's also not useful if your main files get corrupted and you diligently propagate the corruption. See: ma.gnolia
[+] [-] spolsky|16 years ago|reply
I don't mean to be rude, I don't know anything about you, but if you were a system administrator working for me, today would be your last day.
[+] [-] gaius|16 years ago|reply
LOL! And if you get a corrupt block on your primary site, what're you going to do? All your standbys are instantly tainted!
Better leave this one to the grownups.
[+] [-] jws|16 years ago|reply
Spinning disks are good though. A lovely spot for backups. Just put them in a different building.
Edit: In the time it took me to write this 5 other people also lambasted this poor fellow. Ouch.
[+] [-] apowell|16 years ago|reply
[+] [-] InclinedPlane|16 years ago|reply
Redundancy is not a backup. If someone with full admin control to your system can destroy all of your data then you do not have backups. A proper backup is physically separate from your primary data and, preferably, can't be destroyed with mere admin access to the system. The number of sites that have had catastrophic data loss due to relying on mirroring instead of true backups is quite significant.