top | item 13694925

(no title)

zytek | 9 years ago

To each of you guys having those extensive backup solutions (like NAS + cloud sync, second nas, etc)...

.. do you actually TEST those backups?

This questions comes from my experience as a system engineeer who found a critical bug in our MySQL backup solution that prevented them from restoring (inconsistent filesystem). Also, a friend of mine learned the hard way that his Backblaze backup was unrestorable.

discuss

majewsky|9 years ago

Very true. I overheard a similar conversation last week at work: "We have set up the backup procedure for our new production databases." - "Have you tested restore?" - "Well, uhm..." - sound of JIRA ticket being opened

By the way, I misread your username and, for a second, thought you were sytse.

vassilevsky|9 years ago

I'd love to hear that sound :)

Juliate|9 years ago

Excellent point, and that begs another question: how do you actually test your backups? Of course, each case is specific, but is there a "best practice" checklist, or some general points to check for basic restoration?

trcollinson|9 years ago

Excellent question. I do test my backups and restores on a rather constant basis. Each environment within my infrastructure takes a bit of a different approach.

Application

This is by far the easiest for me to test. We have a CI/CD jon which literally makes a new environment, from scratch, and deploys our application to it in a production configuration. It runs a test suite which tests functionality across the application. Finally, it destroys the environment. It reports on each portion of the process. In this way we know exactly how long it would take to redeploy the entire application from scratch on a new infrastructure and get it up and running. This morning it took about about 6 minutes total before tests ran.

Database

We are running an RDBMS. We use a combination of daily full backup, incremental transaction log like backup, and point in time backup. Again, in our CI/CD when a full backup is taken it is pulled, loaded, and a test routine is run against it to check integrity. At this time, the recovery from the day before is destroyed. When a transaction log backup is made, CI/CD picks up this change and applies it to the full backup restore and runs a set of tests for integrity check. This leaves us with a warm standby ready to be switched over to in case of the main database server going down. We have never had to use the warm standby in an emergency but we have a test to make sure we can cut that over as well.

For point in time backup testing this goes back to our application test above. The application test will spin up with a point in time recovery of the database backup. It will test the integrity of that recovery and then test the application against it. Finally, it will swap from the point in time recovered database to the warm backup. It runs the test suite against that for integrity as well.

File Store

People often forget this but those buckets that get hold all of your file storage in the cloud can be destroyed so easily (sad, sad experience taught me this). We test those as well. I am sure you can guess at this point how we do that? CI/CD. It's a rather simple process with a ton of gain.

A few notes

People always ask me this, so I will answer it first. Yes this costs money. It's not as bad as running a second production environment. But it will cost you a bit. My follow up question is, how much does downtime cost you?

My CI/CD is always Gitlab CI at this point. I've used Jenkins. I've used Travis. I like Gitlab CI. You can do all of this with any of those.

We script literally everything. Computers are so good at repetitive tasks. Why would you EVER do anything manually? Really. If it has to do with your infrastructure, script it.

If anyone has any questions about these ideas, feel free to reach out.

RossM|9 years ago

I've been thinking about this as well - it seems like it would fit in nicely with other CI jobs. With database backups, for example, you should be able to script the restore procedure and apply some assertions to check it worked. Bonus with this is that you now have a script when you actually need to restore.

BatFastard|9 years ago

I had a company that I was doing some work for come to me to ask for a copy of the database. Their backs were corrupt, and it was not until they tried to restore it did they find out. But they have 5 years of bad backups

jmathai|9 years ago

I have a script that checks for data rot[1]. It's part of a more comprehensive backup system[2].

[1] https://medium.com/vantage/how-to-protect-your-photos-from-b...

[2] https://medium.com/swlh/my-automated-photo-workflow-using-go...

marcosdumay|9 years ago

I'm a big fan of setting up testing and dev environments from the production backups.

For personal backup of files, I just verify the results are in place. I've checked them once or twice, but honestly, I'm more concerned about my scripts stopping running than they running and not being correct.

Daviey|9 years ago

That is fine, providing you don't operate in a confidential or regulated environment. :/

chadcmulligan|9 years ago

it's not a backup till you test it - just complicated wishful thinking.

chopin|9 years ago

As a colleague of mine says: You don't want a backup, you want a restore.

zapu|9 years ago

I'm using CrashPlan and I have recovered multiple files over past couple of years, that I either mistakenly deleted or overwritten. I haven't tried any full-scale restore, yet, though.

ValentineC|9 years ago

CrashPlan lost some data of mine in 2013 from querying a corrupted Volume Shadow Copy Services database on Windows. (At least, that was their explanation. I'm surprised that their client did not independently verify the data after it was uploaded.)

I moved off CrashPlan in 2016 because their upload speed continues to be embarrassingly slow outside the US even with deduplication and compression turned off (they have a datacentre where I'm at, but it's for Enterprise customers only).

They also highly recommend having 1GB of RAM for every 1TB backed up, which sounded a bit unreasonable to me.

sharkoz|9 years ago

Do you work at Gitlab ? :)

m_samuel_l|9 years ago

too soon

mironathetin|9 years ago

".. do you actually TEST those backups?"

yes (of course), see my post above.

I have used time machine repeatedly to restore lost or damaged files. I also replaced harddrives several times and played back my carbon copy clone. It boots and I have never missed a file in years.

notheguyouthink|9 years ago

This is one thing i like about doing content addressed storage. I've been toying with my own implementation, quite similar to Camlistore.

The net result is it's super simple to verify an entire datastore as being valid or not.

atmosx|9 years ago

I have restored files from BB multiple times. It is a great solution for non technical ppl or offices that have at least 24Mbps connections up/down.

RubyPinch|9 years ago

what happened to the backblaze backup?

CodeWriter23|9 years ago

Interesting question. I wonder if it was prior to when BB moved to the "direct wire" architecture in Storage Pod 4.0.