top | item 25358268

ZFS: Use mirror vdevs, not RAIDZ

115 points| segfaultbuserr | 5 years ago |jrs-s.net | reply

127 comments

order
[+] tjoff|5 years ago|reply
I don't agree.

And if one of your disks failed, and age was a factor… you’re going to be sweating bullets wondering if another will fail before your resilver completes.

So every every single time you loose any drive in a mirror setup you risk all the data on all drives in the entire pool. I sure do hope you aren't on vacation and/or have to order a drive online.

It all depends on your use case. For me, raidz3 wins easily. Performance during resilvering is not something most users would suffer that much from anyway either. And that 8-drive recommendation comes from many factors and don't apply to most home-users anyway. Not that you should go overboard with it but it will easily pay for fast cache or whatever else you might want.

If performance was the goal you would not be using spinning rust anyway, and if you are still limited by a single gbit then don't even think about it (but don't over-utilize your pool).

But there are still lots of potential ways for your data to die, and you still need to back up your pool. Period. PERIOD!

Of course! Yet for home users there is not a single reasonable way to do it if you have decently sized pool. There is bound to be sacrifices on what you choose to backup.

Except for maybe another pool. Which is going to hurt since ZFS with buying everything up front is VASTLY more expensive than RAID where you can grow the array as needed - under the assumption that the storage needs grow slowly, which they typically do for home users.

As a ZFS user and fanboy the temptation of a dual raid6 setup is tempting. But I can't compromise on the filesystem so in the end I've compromised on backups instead. Likely not the smartest move considering how rare bitrot is, I am very well aware.

For many drives I'd go for raidz3. And for a 4-drive NAS I'd go with raidz2 rather than a mirror setup for the ease of mind.

For fast SSD pools I'd go with mirrors. Much easier to backup the entire pool as well.

[+] necheffa|5 years ago|reply
Repeat after me: RAID is not a backup solution, its an uptime solution.
[+] sigstoat|5 years ago|reply
> Except for maybe another pool. Which is going to hurt since ZFS with buying everything up front is VASTLY more expensive than RAID where you can grow the array as needed - under the assumption that the storage needs grow slowly, which they typically do for home users.

your backup pool doesn't have to match your working pool. it certainly doesn't have to be the same size. it just has to be larger than the data you've got.

so you can build a big 64TB pool, and back it up to a set of external USB drives which expand as your actual data expands. similarly you can turn on a bit more compression for the backup pool than you might want on the main one.

[+] jiveturkey|5 years ago|reply
> Which is going to hurt since ZFS with buying everything up front is VASTLY more expensive than RAID where you can grow the array as needed - under the assumption that the storage needs grow slowly, which they typically do for home users.

Not with mirrored vdevs, which you can expand pair by pair.

I'm not arguing for the article, just pointing out your error. The article is very flawed and is bad advice. Not because it's bad per se, but the reasoning fails at basic mathematical analysis. There certainly are valid reasons to use mirrored vdevs. One might be, so you can more easily expand your pool incrementally.

Anyway, from personal experience, my thought is that home users should give up on the idea of "incremental expansion" anyway. Double up each time. I'm surprised that you want to say home users aren't going to use 8 drive setups, while at the same time decrying mirrors in favor of parity. If you have a low drive count, the storage efficiency of parity puts you close to that of mirrors anyway. raidz2 with 4 drives as you suggest makes almost no sense vs mirrors.

[+] 67868018|5 years ago|reply
Your mirrors can be more than 2 wide, you can have automatic hot standbys, you can make each mirror its own zpool so loss of some drives doesn’t lose the entire pool.

Actually losing a mirror vdev doesn’t lose the whole pool anymore for several years now. I’ve recovered data off a zpool where I lost one of three mirrored vdevs. It’s not pretty, but your data is still there. Any files on the missing drives are just 0 bytes.

[+] aidenn0|5 years ago|reply
> Of course! Yet for home users there is not a single reasonable way to do it if you have decently sized pool. There is bound to be sacrifices on what you choose to backup.

It's $130 for an 8TB USB drive at Best Buy. If you don't mirror your backup, that's $16/TB cost, which is quite reasonable.

[+] z3t4|5 years ago|reply
You can add mirrors to increase the pool size. More RAM will speed up the pool, but cache drives also works really well. And you can have a hot spare.
[+] jjav|5 years ago|reply
> So every every single time you loose any drive in a mirror setup you risk all the data on all drives in the entire pool.

Not true. If you lost all the drives in a mirror vdev, you lost those files not the entire pool.

More importantly, mirrors are not limited to two drives. You can have any number.

I use mirror vdevs with 4 drives.

[+] White_Wolf|5 years ago|reply
tbh for my 10 drive home setup I use a Z3 setup. Had dead drives but never more than 1 at a time. I do check them weekly and if they show signs of issues I replace them on the spot though. I do rsync to a second unit in a diff location but (so far) never had to do a restore.
[+] mulmen|5 years ago|reply
What’s wrong with rsync.net or tarsnap for backups?
[+] rsync|5 years ago|reply
We (rsync.net) have several PB of raidz3 deployed all over the world.

We use conservatively sized (12-15 drive) vdevs and typically join 3 or 4 of those together to make a pool.

I can see getting nervous about raidz2 (sort of analogous to "raid6") after a drive failure ... but losing 4 drives out of 12 in a single raidz3 failure cascade is extremely improbable.

We all sleep quite well with this arrangement and have since we first migrated from UFS2 to ZFS in 2012.

[+] tutfbhuf|5 years ago|reply
May I ask what your worst incident with ZFS was ever since 2012?
[+] aidenn0|5 years ago|reply
I was running 3x 12 drive vdevs in raidz2 and write performance was terrible. We moved some data to a different machine and rebuilt as 18 mirrors. This was a long time ago (as in running on Solaris long time ago), so maybe things are better now.
[+] aduitsis|5 years ago|reply
UFS2? So you are using FreeBSD?! Yet another reason to support you folks, seriously.
[+] aDfbrtVt|5 years ago|reply
I love the backup solution you guys provide with Borg, the pricing is amazing and the product has been rock solid. Any chance of getting similar "expert level" pricing for accounts using ZFS send | receive ?
[+] silenteh|5 years ago|reply
May I ask if you run a distributed filesystem on top of ZFS and if so which one ?
[+] xoa|5 years ago|reply
I looked for, and didn't see, "SSD" in this article (let alone "NVMe"). Maybe because it's from 2015? But at any rate, I'm not sure the logic applies there. High performance SSDs remain much more expensive, so losing major capacity is a much costlier issue, and simultaneously they rebuild vastly faster. I thought about this when making a pool out of U.2 NVMe drives, and with rebuild times measured in minutes and given the cost/GB I think RAIDZ2 (or even Z1) vdevs are plenty sufficient for most use cases.

By the same token, what does the backup system and unique pool data lifetime look like? If someone is using a very fast/smaller/expensive pool as a local working space, but it's constantly being replicated to a much more heavily redundant pool of spinning rust in turn backing up sufficiently fast to remote, it may be perfectly acceptable to have minimal redundancy (I still like being able to heal from corruption) in the working pool. If the whole thing going kaput only means losing a few minutes of data it's totally reasonable to consider how much money that's actually worth.

I guess a lot of the blanket advice for ZFS rubs me the wrong way. It's offers a very powerful toolbox full of options that are genuinely great in different circumstances, and there aren't many footguns (dedup being the biggest one that immediately comes to mind) that are hard to reason about. It's a shame if users aren't considering there own budgets, needs, hardware, and so on and taking advantage of it to get the most out of them.

[+] minimaul|5 years ago|reply
As always with RAID-style setups, there’s an inevitable trade off of cost vs capacity vs performance.

There’s still a place for RAIDZ/RAIDZ2, and in my opinion that place is storing bulk data that isn’t too heavily accessed or that needs to be stored with an eye towards keeping £/GB down.

Yes, mirrors are faster. Yes, mirrors are easier to expand. But across 12 4TB disks that is 24TB instead of 40TB with RAIDZ2 - and that’s a lot of capacity to lose if you’re on a budget.

The rebuild times in this post seem high to me, though. I replaced 7x 2TB disks (nearly full) in a raidz1 in a backup pool with larger drives in about 30 hours.

[+] magicalhippo|5 years ago|reply
> The rebuild times in this post seem high to me, though.

Two factors. Disks are getting large, and the rebuild time for RAID-Z[1] is dependent on the fragmentation. In combination it means it can take ages.

I just had to replace a failing WD Red 3TB[2] in an old 4xRAID-Z1 pool, and it took 9 hours. That was a single 3TB disk. The disks in my new pool are 14TB and 16TB.

[1]: https://youtu.be/Efl0Kv_hXwY

[2]: power-on hours in SMART showed over 7 years

[+] minimaul|5 years ago|reply
The one thing in the article I agree with without reservation though is that you should always have a backup of your pool!

ZFS makes doing good backups easy, with zfs send | zfs receive.

[+] zepearl|5 years ago|reply
Btw., I understand that with OpenZFS 2 and its new "sequential resilvering" feature ( https://arstechnica.com/gadgets/2020/12/openzfs-2-0-release-... ) rebuilding should be faster or should at least put less pressure on the drives (less random I/O)... .

Edit: sorry, just noticed that this was already mentioned in another subthread.

[+] jryb|5 years ago|reply
My gripe with this is that the author assumes everyone has the same workload and prefers the same set of tradeoffs. I use raidz1 on my home desktop precisely because I would absolutely prefer to have to wait for a resilver than lose data.

“So backup your data!” - of course, but that’s just an implicitly larger pool.

[+] aidenn0|5 years ago|reply
TFA suggests using mirrors instead of RAIDZ. There is no case in which RAIDZ1 is more durable against data loss than a mirror is.
[+] caillou|5 years ago|reply
I use 10x8TB in RaidZ 2 in my home server. TimeMachine Backup for 6 people, docker volumes and an excessively huge media collection.

The TimeMachine datasets are backed up offsite.

Losing this pool would be a PITA, but not critical.

My primary goal with ZFS is some data redundancy. At a good cost. And quick remote backup for a fraction of the pool. Not performance.

At one point, 2 disks died within 2 days. While there was some panic involved, the data on the server could be reproduced with some time.

There isn’t a best solution, that fits all needs. If there was, ZFS wouldn‘t offer all the options it does.

[+] js2|5 years ago|reply
This article is pretty hand-wavy. It doesn't give any empirical numbers at all.

I've had a small FreeNAS server using 4 x 3TB SATA drives in a mirrored config for years now and it's out of space. I'm about to build a new server using used 10 x 3TB SAS drives and intend to put all ten disks into a RAIDZ2 vdev. I care more about space than performance or rebuild times. Before I load it with data, I'll do some testing of read/write performance and rebuilding times. If they're unacceptable, I'll try two smaller RAIDZ vdevs and if that still doesn't work, I'll go back to mirrors.

[+] _jal|5 years ago|reply
The article is another entry in a long series of bad ZFS articles.

For some reason a lot of people get to a point where they're comfortable with it and suddenly their use case is everyone's, they've become an expert, and you should Just Do What They Say.

I highly recommend people ignore articles like this. ZFS is very flexible, and can serve a variety of workloads. It also assumes you know what you're doing, and the tradeoffs are not always apparent up-front.

If you want to become comfortable-enough with ZFS to make your own choices, I recommend standing up your ZFS box well before you need it, and play with it. Set up configs you'd never use in production, just to see what happens. Yank a disk, figure out how to recover. If you have time, fill it up with garbage and look for yourself how fragmentation effects resilver times. If you're a serious user, join the mailing lists - it is high-signal.

And value random articles on the interwebs telling you the Real Way at what they cost you.

I'm convinced articles like this are a big part of what gives ZFS a bad name. People follow authoritative-sounding bad advice and blame their results on the file system.

[+] louwrentius|5 years ago|reply
I strongly disagree with this old blogpost.

I feel that this advice is a somewhat dishonest attempt to plaster over the fact that you can't expand a VDEV.

https://louwrentius.com/the-hidden-cost-of-using-zfs-for-you...

So you try to burry that fact by promoting mirrors. But mirrors aren't as safe as RAIDZ2 and they aren't as space/efficient.

It all depends on circumstances, but if you want to store a ton of data, RAIDZ(2|3) seems the right way to go.

Use RAIDZ(2|3) vdevs, not mirrors.

[+] somehnguy|5 years ago|reply
No thanks. I use raidz2.

With mirror vdevs if you lose the wrong 2 drives you lose everything. I can lose any 2 drives and be totally fine.

The probability of losing the wrong 2 drives at once is small, sure. But I would rather just not care about that probability. And I don't lose half my capacity, which for a home user (I don't have an unlimited budget!) matters a whole lot more than having the absolute best iops.

[+] segfaultbuserr|5 years ago|reply
The original article already included a reply to this question.

> But wait, why would I want to trade guaranteed two disk failure in RAIDZ2 with only 85.7% survival of two disk failure in a pool of mirrors? Because of the drastically shorter time to resilver, and drastically lower load placed on the pool while doing so. The only disk more heavily loaded than usual during a mirror vdev resilvering is the other disk in the vdev – which might sound bad, but remember that it’s no more heavily loaded than it would’ve been as a RAIDZ member. Each block resilvered on a RAIDZ vdev requires a block to be read from each surviving RAIDZ member; each block written to a resilvering mirror only requires one block to be read from a surviving vdev member. For a six-disk RAIDZ1 vs a six disk pool of mirrors, that’s five times the extra I/O demands required of the surviving disks.

I think it's perfectly okay to disagree. But any comment must contain a counterargument to be useful, for example, one could've argued that load is not an issue in a small array and so on, and I may happily accept the other side of the argument. However, I respectfully criticize that your comment doesn't include any counterargument, it's not useful, please read the article more carefully next time.

[+] 67868018|5 years ago|reply
You don’t lose everything, only the files that were on that vdev. I’ve been through this. It’s not fun, though, and your pool is irreversibly damaged but your data is not all lost
[+] naniwaduni|5 years ago|reply
"Use RAID10" is basically the storage equivalent of "use paper ballots". Sure, it feels suboptimal, but critically your intuition on how it can fail (mostly) works. That's a really nice property to have for storage.
[+] Aaargh20318|5 years ago|reply
One problem with mirror vdevs vs. RAIDZ2 : In the RAIDZ2 case you can lose any two drives and still have your data, in the mirror vdev case, 2 drives failing in the same vdev means all your data is gone. You could potentially be okay with losing half your drives, but only if all the failed drives happen to be part of different vdevs.
[+] sgarland|5 years ago|reply
As long as you aren't relying on it as backup, it doesn't matter for most use cases.

I'm about to build a zpool consisting of nothing but 3-wide raidz1 vdevs. I can tolerate one drive dying. In the ~8 years or so I've been running a NAS, I've had precisely one drive failure. I am fully aware that survivorship bias is a thing, and anecdotes aren't data, but it's good enough for me.

Anything important is backed up locally and to the cloud. Everything else is merely annoying to have to download again.

[+] myrond|5 years ago|reply
I disagree.

With co-located boxes and drop shipped drive replacements the time between a FAULT and the resilver event can be multiple days. Even though the resilver will go faster with a mirror having one disk remaining on a mirror vdev compared to raidz2 (or higher) mirrors will increase risk of data loss irrespective of resilver times because of drop ship drive replacement time.

3TB resilver on my last mechanical drive failure took 6 hours 30 minutes. Plus an additional 3 days for the drive to arrive.

With mirror vdev setups you lose significantly more space as well. If you argue speed is worth it, then I would instead invest that money you saved going with a raidz2 with NVME cache and SLOG.

Users won't notice the resilver event at all with a significant amount of memory and NVME cache + nvme SLOG tuned with a high /sys/module/zfs/parameters/zfs_dirty_data_max and larger than default /sys/module/zfs/parameters/zfs_txg_timeout.

[+] joekrill|5 years ago|reply
I've been using ZFS for a few years now and it's just been amazing. Storage is so cheap these days that it's just much simpler and straight-forward to use mirrored vdevs. If a single drive fails, the pool is still completely usable, and all I have to do is swap out the bad drive (when I get around to it, no hurry usually) and resilver. Resilvering can take a while, but everything is still completely usable while it's happening so it just runs in the background and I don't even notice.

I just upgraded a pool of 4 drives from 4TB each to 10TB - I'd never done anything like that before and was rather nervous, but it was just so simple with the mirrored set up - just swap out each drive and resilver one- by-one.

[+] angry_octet|5 years ago|reply
I would like to see an update for this for SSDs. The devices are black boxes of mystery, possibly likely to fail nearly simultaneously (at least in terms of being unable to write). Early on we left 20% if capacity unformatted to allow more wear leveling, but running them in RAID1 seemed both necessary and likely to cause synchronous errors. RAIDZ1 over three disks might be okay. Obviously huge R/W bandwidth so resilvering doesn't take long.

Also, it isn't mentioned, but having the ZIL on battery backed flash can give huge improvements in IOPS to anything, and is far more valuable than extra TB or spindles.

[+] npteljes|5 years ago|reply
I had my pitchforks out because I really like my raidz1 setup, but the arguments are very good for the mirror. When I set my pool up, I chose raidz1 because compared to mirroring, I still have 37.5% more space, which is huge if the budget is constrained. But if you feel paranoid and the raidz3 begin to sound good, that 12.5% extra space doesn't seem to worth the extra complexity over a much straightforward mirror.
[+] tw04|5 years ago|reply
There's no planet on which I would trust a mirror with a 10TB + SATA drive. There's a reason the major storage vendors already have or are working on 3-disk parity for large NL-SAS/SATAS drives.

Give me RAID-Z2 or RAID-Z3 with dRAID all day long (although I wouldn't deploy dRAID quite yet on production workloads).

[+] gbrown_|5 years ago|reply
> don’t be greedy. 50% storage efficiency is plenty.

LOL.

[+] geerlingguy|5 years ago|reply
Every time I make this argument (whether regular RAID or ZFS), everyone pulls out their pitchforks and tells me about how they have run RAID 5 or 6 for years and never had a problem, plus "I can have two drives fail! You would lose your array if the wrong two drives fail!"

But drive failure is much more likely during an operation like a resilver/resync, and I wonder if the majority of people espousing more risky RAID a setups (especially single parity) are those who just want hundreds of TB for a growing media collection.

I know for my critical data, I don't trust parity. Plus I can't afford the days or weeks long resilver operation, I need a working storage array that doesn't suffer drastically in performance if one drive goes bad.

[+] seized|5 years ago|reply
People act like resilvers on parity are vastly different than on mirrored. They are not, its just reads and writes, granted in a different pattern. My RAIDZ2 resilvers at about what I expect the old drives in it to write at, ~100MB/sec. In a mirrored setup it all depends on the one drive doing the reading to survive the process. The drive that is most likely the same age as the one that just failed.

A resilver isnt more likely to kill an existing drive than a scrub. And the general advice is to do those semi regularly.

A blanket statement equating parity RAID to "more risky" is absurd and leads to people thinking they need a massive pool of mirrors just for their home files. ZFS has numerous options to prioritize different IO higher or lower (priority to resilvers or end user data).

[+] nix23|5 years ago|reply
I then tell a real horror-story about my past, not so long ago there was a EVA (A SAN from HP) with a freaking out hardware Raid out of nowhere (uptime about 200 days no Firmware update before nothing) it just started to freak out, the story goes on but the lesson...No more HW-Raid/proprietary stuff for me EVER again.
[+] Cyph0n|5 years ago|reply
Critical data for the average user typically does not require much space, so the cost of storage overhead in a mirrored ZFS pool is not too bad.

RAIDZ is much more cost efficient when you need to reliably store large content that isn’t really critical. As you’ve noted, media is one good example of this.

In my case, I’m using a RAIDZ1 vdev of 3 disks, with the understanding that things could go south during a resilver.

[+] nix23|5 years ago|reply
For big installations i do striped (sometimes mirrors) vdevs (2 disks) and on top of them Raidz2/3. Works fast as hell and reliable.