top | item 20139121

Battle testing data integrity verification with ZFS and Btrfs

181 points| iio7 | 6 years ago |unixsheikh.com | reply

98 comments

[+] kissgyorgy|6 years ago|reply

Last year when I wanted to build a 10 disk ZFS server in RAIDZ2 and researched about the data-integrity and fault-tolerance aspects, I found this video of guys literally making hardware failure by conducting electricity into the motherboard attached to a RAIDZ2 array:

https://www.youtube.com/watch?v=vxFNBZIAClc

and they could not make any errors. This was pretty brutal. When I saw this video, I decided I never want to use any other filesystem than ZFS ever.

[+] the8472|6 years ago|reply

I think irradiating the system would be a better test since it would induce random bitflips anywhere in the system while it is keeps running and reading/writing data instead of inducing a massive fault that will almost immediately stop any IO operations.

Overclocking some components might work too.

[+] mirceal|6 years ago|reply

this is cool, but I have to wonder how much of this is ZFS and how much is the hardware. don’t get me wrong: it’s impressive, but to make the claims about ZFS you need to have a scientific approach that would involve a control group, various types of motherboards, memories, power supplies, etc + thorough reproducibility of the conditions.

[+] thinkingkong|6 years ago|reply

Ha! That reminds me of the "Yelling at the Fileserver" video.

https://www.youtube.com/watch?v=tDacjrSCeq4

[+] tjoff|6 years ago|reply

Did they even perform any file (write)operations when testing? If not that was a waste of time... FAT32 would have performed just as well.

[+] techslave|6 years ago|reply

that is just a sensational science video. there’s no testimony given that normal raid5 wouldn’t have survived exactly as well. he needed to have mentioned what happened on reboot fatter running a scrub.

well not exactly true. we know that the raid5 write hole won’t tolerate random power off / memory removal etc.

as someone else said, x ray test would have been better.

[+] bronco21016|6 years ago|reply

I’m really confused what the author is trying to accomplish here. He sets a scenario that Btrfs openly says is known to cause issues. He doesn’t necessarily come to the conclusion that one needs to trash Btrfs but I’m not sure why you would go through this exercise prior to deployment if the exercise is pitting an undeployable configuration against something that has already been heavily battle tested. Until Btrfs development marks the RAID5/6 write hole issue fixed this test is pointless.

I’m a little disheartened to read all of the negative comments about Btrfs in this thread as well. I’ve spent a ton of time researching Btrfs in RAID 10 for deployment on my home lab (99% Linux environment) for when it’s time to expand storage and from everything I read it seemed like it was going to be a good idea. Now I’m back to wondering if I should research ZFS again.

[+] viraptor|6 years ago|reply

I took it as: in an unsupported case this is what happens and these are the errors you get. It's useful information and makes the description easier to Google when you run across it.

[+] aeroevan|6 years ago|reply

I have been using btrfs in RAID 10 at home for several years with no issues. From what it sounds like, he didn't really have any issues with RAID 5, but btrfs scrub took longer than the ZFS equivalent.

I'm even using btrfs with zstd compression on my laptop since it only has a small ssd (64 GB) and it makes it a lot more usable.

[+] cmurf|6 years ago|reply

I think you're best off using what you're familiar with, and keeping multiple backup copies of the important stuff. I've used Btrfs for 9-10 years and haven't ever had unplanned data loss. Planned, due to testing, including intentional sabotage, for bug reporting, yes. And in the vast majority of those cases, I could still get data off by mounting read-only. I use it for sysroot on all of my Linux computers, even the RPi, and for primary network storage and three backup copies. A fourth copy is ZFS.

If you've had a negative experience, it can leave a bad taste in the mouth. People do this with food too, "I once got violently sick off chicken soup, I'll never eat it again." I wouldn't be surprised if there's an xkcd to the effect of how filesystem data loss is like food poisoning.

There is a gotcha with Btrfs raid10, it's does not really scale like a strict raid 1+0. In that traditional case, you specify drive pairs to be mirrors, sometimes with drives on different controllers so if a whole controller dies, you still have all the other mirrors on other controller and the array lives on. You just can't lose two of any mirrored pair. Btrfs raid10 is not a raid at the block level, it's done at the block group level. The only guarantee you have with any size Btrfs raid10 is the loss of one drive.

[+] funkaster|6 years ago|reply

Really good article. I also agree with the author: ZFS is light years ahead of Btrfs. I currently use it in my home backup server (used to be a rpi with 2 hdd, moved to rockpro64 with 4 hdd and a sata controller) and it's just great: super easy to maintain and fix, even swapping disks is not a huge endeavor.

[+] h1d|6 years ago|reply

Is btrfs still actively developed? RedHat ditched it and never heard btrfs ever got widely popular, so I take it as an abandonware after all these years of slow progress.

[+] FullyFunctional|6 years ago|reply

Reading through most comments and I still think a few points are worth mentioning:

* btrfs has a few features that are Really Nice and missing in ZFS: ability of have a file system of mixed drives and adding and removing drives at will, with rebalancing. With ZFS growing a file system is painful and shrinking impossible(? still?). There has been work on it recently though.

* ZFS has a IMO MUCH cleaner design and concepts (pool & filesystems); mirrored by a much cleaner and clearer set of commands. Working with btrfs still feels like an unfinished hack. As human error is still a major concern, this is not a trivial issue.

I _have_ lost data to MD, had scary issues with BTRFS, but never had issues with ZFS in 8+ years. (The fact that FreeNAS is FreeBSD based which I'm less inclined to mess with also means that I mostly leave my appliance alone.)

[+] rdc12|6 years ago|reply

Device removal was added in the recent ZoL 0.8 release, so with that you can remove a vdev that is either a sole drive, or a mirror. Currently it can't be used to remove a RAIDZx vdev thou. It does carry a memory overhead for a remap table , but this does shrink as old allocations are retired back to the pool.

And the first alpha for RAIDZ expansion became available a week or two ago. (For going from say a 6 disk RAIDZ2 vdev, to a 7+ disk RAIDZ2 vdev). Just in case anyone decides to play with this feature, the on disk format for this feature is not stable yet, only use it on test pools.

[+] agapon|6 years ago|reply

> With ZFS growing a file system is painful and shrinking impossible

You probably mean something else rather than a file system. In ZFS you do not need to grow or shrink file systems at all.

[+] pnutjam|6 years ago|reply

I've been using btrfs for years with no problems. I currently use a btrfs volume for my backup drive. It mounts, accepts the backup, takes a snapshot and unmounts. Has anyone seen how btrfs handles sync, it's pretty awesome, like rsync, but only sends changed blocks.

[+] bakul|6 years ago|reply

I first started using zfs in 2005 when my hardware raid failed. Since then I’ve moved the disks to a new server in 2009 and replaced all the disks twice (one at a time for redirecting). Finally I built a new server this year. This time I’m using zfs send/recv to copy data to the new disks. The old server is still working 10 years later & its latest disks have been in use 24x7 for over 5 years now. Zpool scrub on the old server takes days now (compared to one hour on copied zpool on the new server).

Even back in 2009 I heard some Linux enthusiasts tell me how btrfs was going to be better than zfs!

[+] yjftsjthsd-h|6 years ago|reply

> Even back in 2009 I heard some Linux enthusiasts tell me how btrfs was going to be better than zfs!

What's sad is that it should have been; the CDDL situation is really unfortunate. Honestly, even if BTRFS performance were worse, it would be worth it in order to have a fully-supported mainlined FS... but instead its reputation is for data loss, so it's dead (yes, I know it works if you're careful, but that's a terrible quality in a filesystem).

[+] O_H_E|6 years ago|reply

Related: take a look at bcachefs

https://en.m.wikipedia.org/wiki/Bcachefs

[+] kzrdude|6 years ago|reply

Bcachefs is a good idea, will be interesting when it is stable.

[+] zaarn|6 years ago|reply

I run bcachefs on all my personal machines and it's been quite a joy. As fast as ext4 with most of the features of btrfs. I can't wait until it's mainlined.

[+] aidenn0|6 years ago|reply

ZFS is the only filesystem I've ever had completely crap out on me without any indication of a hardware issue[1]. I don't recall the error message I got currently, but asking around about it on the various ZFS irc channels the answer was invariably "I hope you have backups" This was probably a fluke, but did sour me a bit.

1: Btrfs refused to mount at one point due to a bug; the helpful folks on #btrfs walked me through the process of downgrading my linux kernel to get it into a working stat eagina. At this point I switched away from btrfs.

[+] myrandomcomment|6 years ago|reply

I use a FreeNAS box at home..from IXsystems. Good stuff. I paid a bit more for ECC. Why? my families history, all this pictures are kept on the NAS (backup to backblaze) and a local dive. In the last 8 years, in Photos I have encountered maybe 11 issue where the picture was screwed. Each time I looked at the NAS copy, snapshot, etc. where I was able to recover the correct photo. The cost difference over time in a few cups of coffee. It is worth it. If you can afford the NAS I do not understand how you cannot afford the ECC.

[+] tomxor|6 years ago|reply

> Myth: ZFS requires tons of memory [...] The only situation in which ZFS requires lots of memory is if you specifically use de-duplication

It's also totally worth tons of memory when you use that feature with intent. If you use dedup in combination with automated snapshots you get the most space efficient, fast and reliable incremental backup solution in existence - yes it will consume your whole server, that's the cost (works best separately as a backup server).

[+] vasili111|6 years ago|reply

Whats is your personal experience with Brtfs

[+] opless|6 years ago|reply

BRTFS catastrophically failed on me twice.

I have ZFS on another box that failed badly but lost no data... That was with bad ram and a motherboard that was on the "do not use" list for making ZFS NAS boxes. It always recovered the errors

Currently using it in RAID mode to hold large data sets and CCTV footage for my homelab on three drives that have smart warnings for age without any issues at all for the past two and a half years and two Ubuntu upgrades

[+] JustFinishedBSG|6 years ago|reply

BTRFS is the only filesytem that failed and was rendered unrecoverable in all my life ( after a hard reset ) .

That's an anecdote sure but that's enough to never use it ever again for me.

In my opinion BTRFS is rotted at the core, I'm more interested by Bcachefs future.

[+] maxdamantus|6 years ago|reply

I've been using btrfs as my primary filesystem on all of my computers (other than my phone) since 2013.

Initially I had performance issues due to the default block size back then (4 KiB rather than 16 KiB; ended up just rebuilding the filesystem). There were some other issues back then regarding rebalancing and scrubbing stability, and sometimes I would have to run "btrfs-zero-log" before mounting, but I haven't had those sorts of issues for a while.

I've had multiple drive failures on my systems, and btrfs seemed to handle them as expected. I use "raid1" for metadata and "single" for data. As far as I can tell, all of the errors were due to bad blocks on physically failing drives, and btrfs was able to indicate which files were affected in all of those cases.

I've also used it to "fix" the Raspberry Pi SD card corruption issue by just running btrfs in "dup" mode—prior to that, the SD card would randomly end up with blocks being zeroed and the system would obviously start to crash, whereas btrfs just fixes the blocks up in place as it accesses the data the next time.

[+] zlynx|6 years ago|reply

I've been using btrfs on my home NAS (a custom build) and on my Linux laptops with SSDs since 2012. I run Fedora on everything which I think helps because I've never been stuck on an old kernel with old, stupid bugs. Only the freshest bugs for me!

The NAS had two WD Red drives fail over the years with bad blocks. Not at the same time. They were detected and I replaced them with new bigger drives. Since 2012 I added more drives until now it is at 6 drives in RAID10: 4x6 TB and 2x4 TB.

I've had out of space errors on my laptops which I had to repair by adding more storage so I could successfully rebalance. Stealing the swap partition worked great for that.

Never lost any data or had any corrupt files.

Also, I always build with quality Gold standard PSUs, ECC RAM, and run my systems always plugged into an APC UPS. So I've never tried to recover a system that crashed after a lightning storm, built with WD Green drives in external USB enclosures plugged into a $2 power-strip and an HP "desktop" built out of an old laptop motherboard and the cheapest PSU HP could dig out of the trash pile.

[+] Cogitri|6 years ago|reply

Catastrophic, my BTRFS died twice on me after a hard reset and the recovery tools haven't helped me. Luckily I had backups around, so switching to ZFS 0.8 with encryption was a breeze.

[+] mikedilger|6 years ago|reply

I've had one catastrophic failure with btrfs long ago, and have rarely used it since.

I've had one failure with ZFS that required developer help (space map corruption) but I got all my data back.

[+] rhn_mk1|6 years ago|reply

NAS usage, over mdadm RAID1, over about 10 years (and 1 or 2 drive failures, 1 or 2 bad RAM dies) I didn't lose a file that I could definitely attribute to btrfs. I did lose a piece of a file I keep on it, but I cannot conclude it didn't arrive there damaged.

I used to hit out of space errors on rebalancing at high usage levels, but since around kernel 4.0 the only time I hit one was when I altered data duplication settings.

[+] josteink|6 years ago|reply

For some reason my laptop with a single-volume btrfs root-fs, would regularly have btrfs degrade into read-only mode.

Rendering my laptop and the running system inoperable without a hard reset.

I’ve had no such issues with ZFS, not even on the exact same hardware and Ubuntu-release.

Needless to say, I’m not using btrfs anymore.

Edit: for perspective, I’ve never had this btrfs issue on desktops or servers. Laptops only. Might be suspend/resume related?

[+] pwg|6 years ago|reply

Catastrophic. BTRFS failed twice with a large raid array when one of the drives started developing bad sectors (two different drives at two different times, same result, entire raid array unrecoverable).

After catastrophic failure number two I decided to never run BTRFS again.

[+] Piezoid|6 years ago|reply

Honestly not bad. About 7 years of use on personal machine, single volume. It simpler than ZFS and allows cheap incremental backups with send/receive.

One year ago I built a RAID56 with 5 used 2TB drives from eBay. Risky move, but it went smoothly so far. It's only for home storage and important stuff is backuped off site anyway. One Seagate drive died with lots a bad sectors. The replace command took really long even with the "-r" flag (don't read from replaced drive, in theory), so I ended up unplugging the drive and rebalancing from there.

I have high hopes for bcachefs. We have a real need for a modern FS with tiered caches. I backed the project but I don't have the skills or time to help.

[+] lucb1e|6 years ago|reply

I thought I did things right and had setup regular scrubbing. The two external drives that were btrfs mirrors of each other got out of sync, I assume caused by a power outage (the host was a laptop and has a battery, a rare scenario I guess, so I can see how this happened). Due to not reading the man page correctly, I then messed things up further...

This was like a year ago and I still haven't cleaned it all up due to: (1) lack of a third drive to restore stuff to (I'd prefer to leave the damaged filesystems in read-only), (2) a lack of time, (3) there now seems to be a bug in the kernel drive for my particular drives (perhaps it wasn't a power outage after all?), and (4) because I now live far away from the physical location of the drives.

Instead of using Btrfs to fix problems, I am now looking for the simplest possible solution. I was thinking of just using plain old ext4 and relying on one local and one off-site backup. It will be more work to manually look at the state of my files when a failure happens, but with something like Restic I'm at least confident that any completed backups are sound (as well as secure: Restic is the first system that I've found to be efficient while also trusting the crypto enough to back up to untrusted locations). The only open question is what to do about bit rot on the main system, since any bit rot on ext4 would just be backed up as if it was the original data. So then... maybe I'll go with Btrfs after all, using only the checksumming feature should not cause any bugs right? I haven't decided yet. First up is trying to fix this driver issue somewhere next week.

[+] rodgerd|6 years ago|reply

For the years I ran btrfs it was a great way to testing my backup and restore processes; it also gave me a nostalgia trip back to the mid-late nineties Linux experience, where you had to carefully select kernels based on which combination of features and regressions you can live with.

[+] archi42|6 years ago|reply

My cold storage NAS with 12 old HDDs (400GB to 1TB) acts as my btrfs test-env. It's all old hardware: Disks are salvaged from the company [after being shredded], the system is an Athlon Dual Core 5000+ on some MoBo with 4 SATA ports; no ECC, recently replaced the LSI 8 disk SAS controller with a 3ware 9560 12 disk variant (both as JBOD).

No problems in the past few years, though it's only seeing a few hours uptime per week. It just acts as backup for data stored elsewhere, so a crash would be annoying, but not fatal - but I'm contemplating getting some new 4TB disks (now that I have a suitable controller) and putting a Plex on the thing.

[+] pmlnr|6 years ago|reply

Has been running perfectly fine as single disk/partition fs for me. I'm using it on our laptops' data partition because of the compression support.

I wouldn't trust it's RAID features though.

[+] Dylan16807|6 years ago|reply

I haven't lost any data on BTRFS, but on a couple drives set to do frequent snapshots they ended up getting slower and slower until hanging for up to many seconds at a time. And this was with only 30-40 snapshots existing at a time.

And then in another instance it got all confused and impossible to mount read-write. It kept trying to resume some kind of transaction that wouldn't complete even with dozens of hours of CPU time.

But on the plus side it deduplicates properly, without extreme overhead.

[+] riquito|6 years ago|reply

Home usage,with encryption, over 4/5 years, had some hard reset and never lost data. Nothing to complain

[+] unixhero|6 years ago|reply

I use BTRFS without any need of being afraid of FS crash. BTRFS has a lightweight footprint and let's me create a raid0 over many old hardrives.

And ...zfs for secure workloads.

[+] cmurf|6 years ago|reply

This is a really good write up.

It is curious the performance differences found between ZFS and Btrfs, as I've always had the reverse experience, with ZFS being slower by maybe 15%. The scrubbing on md raid does take a while, every block must be checked as it has no idea what blocks are in use or not; although a write-intent bitmap would avoid a complete resync after an unclean shutdown.

[+] StavrosK|6 years ago|reply

I have a joke NAS and I use ZFS for the disks. Unfortunately, at some point I started having problems where deleting a file will take half a minute (each file is only around a gigabyte). I have no idea why performance is killed like this, and nobody I asked on IRC seems to know why this is happening.

[+] walrus01|6 years ago|reply

RAID5 has been excessively risky and obsolete for a long time - not enough parity, and too much risk of data loss from unrecoverable read error during a massive-drive multi drive rebuild (like, an eight drive * 8TB RAID5). Better tests for production use would be RAIDZ2.

[+] bscphil|6 years ago|reply

I agree about RAID5, but personally I don't think RAIDZ2 really works as an alternative. Everyone should use RAID10.

The reason for this actually has nothing to do with data protection (although not having to read your entire disc set to rebuild is nice). The reason is that it's hard to figure out who RAID5/6 are actually for. The enterprise is all on RAID10 (hence no one fixing the BTRFS write hole). So you'd think it would be for enthusiasts who don't want to purchase as many discs, right? Well, in my case at least, parity disc modes are useless for me because it means I would have to buy discs of exactly the same size into the indefinite future!

I started with 4TB discs, then I've added 8TB discs, and now I'm looking at newer Western Digital 10TB helium discs, which are actually much lower power than the 8TB ones. But none of those 4TB or 8TB discs have failed! So while in theory if stick to the drive size you start out with, you can save money with RAIDZ by using parity discs and adding a drive when you need more storage (at the cost of a decreased parity %), practically speaking a lot of that money is wasted since you either have to neglect larger more efficient drives, or replace working older drives when you want to upgrade.