top | item 8305653

(no title)

mililani | 11 years ago

This may be a little off topic, but I used to think RAID 5 and RAID 6 were the best RAID configs to use. It seemed to offer the best bang for buck. However, after seeing how long it took to rebuild an array after a drive failed (over 3 days), I'm much more hesitant to use those RAIDS. I much rather prefer RAID 1+0 even though the overall cost is nearly double that of RAID 5. It's much faster, and there is no rebuild process if the RAID controller is smart enough. You just swap failed drives, and the RAID controller automatically utilizes the back up drive and then mirrors onto the new drive. Just much faster and much less prone to multiple drive failures killing the entire RAID.

discuss

halfcat|11 years ago

This can not be stressed strongly enough. There is never a case when RAID5 is the best choice, ever [1]. There are cases where RAID0 is mathematically proven more reliable than RAID5 [2]. RAID5 should never be used for anything where you value keeping your data. I am not exaggerating when I say that very often, your data is safer on a single hard drive than it is on a RAID5 array. Please let that sink in.

The problem is that once a drive fails, during the rebuild, if any of the surviving drives experience an unrecoverable read error (URE), the entire array will fail. On consumer-grade SATA drives that have a URE rate of 1 in 10^14, that means if the data on the surviving drives totals 12TB, the probability of the array failing rebuild is close to 100%. Enterprise SAS drives are typically rated 1 URE in 10^15, so you improve your chances ten-fold. Still an avoidable risk.

RAID6 suffers from the same fundamental flaw as RAID5, but the probability of complete array failure is pushed back one level, making RAID6 with enterprise SAS drives possibly acceptable in some cases, for now (until hard drive capacities get larger).

I no longer use parity RAID. Always RAID10 [3]. If a customer insists on RAID5, I tell them they can hire someone else, and I am prepared to walk away.

I haven't even touched on the ridiculous cases where it takes RAID5 arrays weeks or months to rebuild, while an entire company limps inefficiently along. When productivity suffers company-wide, the decision makers wish they had paid the tiny price for a few extra disks to do RAID10.

In the article, he has 12x 4TB drives. Once two drives failed, assuming he is using enterprise drives (Dell calls them "near-line SAS", just an enterprise SATA), there is a 33% chance the entire array fails if he tries to rebuild. If the drives are plain SATA, there is almost no chance the array completes a rebuild.

[1] http://www.smbitjournal.com/2012/11/choosing-a-raid-level-by...

[2] http://www.smbitjournal.com/2012/05/when-no-redundancy-is-mo...

[3] http://www.smbitjournal.com/2012/11/one-big-raid-10-a-new-st...

Twirrim|11 years ago

Note that the 10^14 figure is only what the HDD mfgs publish, and it has been the same for something like a decade. It's a nice, safe, conservative figure that seems impressively high to IT Directors and accountants, and yet it's low enough that HDD mfgs can easily fall back on it as a reason why a drive failed to meet a customers expectations.

In reality you'll find that drives perform significantly better than that, arguably orders of magnitude better.

That said, I'm still not a fan of RAID5. Both rebuild speed and probability of batch failures of drives (if one drive fails, the probability of another drive failing is significantly higher if they came from the same batch), make it too risky a prospect for my comfort.

feld|11 years ago

10^14 bit error rate is false or routine ZFS scrubs would produce documented read errors. I'm inclined to believe the entire math here is wrong as well.

coned88|11 years ago

What I don't understand about the URE is from [2]. If you have a 12TB raid 5 array and you need to rebuild. If 10^14 approaches an URE at around 12TB of data as the article says. What causes it to hit 12TB? Each disk has 10^14 which is 100TB. If you had 12TB from 4x3TB disks it should have alot to go through.

Forlien|11 years ago

I think your calculation on failing an array rebuild is wrong. Can you show how you got those numbers?

t0mas88|11 years ago

There is even a community for it: BAARF - http://www.miracleas.com/BAARF/

The main reason is not only the rebuild time (which indeed is horrible for a loaded system), but also the performance characteristics that can be terrible if you do a lot of writing.

We for example more than doubled the throughput of Cassandra by dropping RAID5. So the saving of 2 disks per server in the end required buying almost twice as many servers.

ntsb|11 years ago

Wouldn't rsync of been a better and more reliable choice for this?

golan|11 years ago

The problem with rsync is that it would have to traverse all the FS tree and check every single file on both sides for the timestamp and the file. In a relatively small FS tree is just fine, but when you start having GBs and GBs and tens of thousands of files, it becomes somewhat impractical.

Then, you need to come up with other creative solutions, like deep syncing inside the FS tree, etc. Fun times.

Add checksumming, and you probably would like to take a holiday whilst it copies the data :-)

grammar|11 years ago

s/of/have

greenknight|11 years ago

I had a RAID 6 have a failed disk a couple weeks ago... 8 x 3TB drives... 18TB usable. Took 7 hours to rebuild.

Disks were connected via SATA 3 and was using mdadm

dspillett|11 years ago

Assuming it is reading and writing almost simultaneously that is ~125Mbyte/sec on each physical drive, which isn't bad considering the latency of the read-from-three-or-four-then-checksum-then-write-to-the-other-one-or-two process if you if the array was was in use at the time.

There are a few ways to improve RAID rebuild time (see http://www.cyberciti.biz/tips/linux-raid-increase-resync-reb... for example) but be aware that they tend to negatively affect performance of other use of the array while the build is happening (that is, you are essentially telling the kernel to give more priority to the rebuild process than it normally would when there are other processes accessing the data too).

click170|11 years ago

In my experience working with Debian, the rebuild is sensitive to disk usage. If you start using the disks, the rebuild will slow down intentionally so as to not choke the drives. This could explain why it was fast for you but not for the parent.

Edit: To clarify, I was using mdadm, not hardware raid.

LeoPanthera|11 years ago

How does the rebuild time of ZFS's raidz compare to RAID5/6?

craigyk|11 years ago

My largest ZFS pool is currently ~64TB ( 3 X 10 3TB (raidz2) ) The pool has ranged from 85%-95% full (it's mostly at 85% now and used mostly for reads).

Resilvering one drive usually takes < 48 hours. Last time took ~32 hours.

Something else cool: When I was planning this out I wrote a little script to calculate the chance of failure. With a drive APF of 10% (which is pretty high), and a rebuild speed of 50MB/sec (very low compared to what I typically get) I have a 1% chance of losing the entire pool over a 46.6 year period. If I add 4 more raidz2 10X3TB VDEVS that would drop to 3.75 years.

XorNot|11 years ago

I have a 24TB RAIDZ3 array. Losing a disk out of that takes about 12-18 hours to rebuild.

diplomatpuppy|11 years ago

It really depends on the load and if your raidz has 1,2, or 3 spare drives. From what I've experienced, re-slivering a the new drive in a raidz seems to happen at about 30% of the maximum ZFS throughput under medium load. And even better than many RAID5/6 implementations - it only re-slivers enough to store your data and not just the entire drive.

fiatmoney|11 years ago

It is generally superior because it can skip blank blocks. RAID is byte-by-byte.

kyrra|11 years ago

If you want to burn even more disks, 6+0 or 5+0 can be a nice way to cut down on the worst car senerio of losing data.

beachstartup|11 years ago

in my opinion, raid 10 is by far the best if you have to support the operations of a production site in a real world environment. just keep spare controllers and drives handy. it's just a matter of swapping out failures.

three cost levers: capacity, speed, and downtime.

basically, at some point you will begin to realize the cheapest cost is the capacity itself, everything else is an order of magnitude more difficult to justify. downtime and slowness are simply unacceptable, whereas buying more stuff is perfectly acceptable and quite feasible solution in an enterprise environment.

hobs|11 years ago

Exactly.

I work with a number of medium sized businesses of which you can clearly tell who have had and who havent had major downtime due to these types of hardware failures.

Many people who have never experienced issues have no interest in even forming a basic DR plan, but the ones who have just start throwing money at you the moment you mention it.

Buying more stuff is WAY cheaper than slowness and downtime, and penny pinching on this count is going to cause so much additional heartache for what will in the end cost more anyway.