top | item 39765715

Bug hunting in Btrfs

243 points| todsacerdoti | 2 years ago |tavianator.com

204 comments

order
[+] JonChesterfield|2 years ago|reply
The visualisation of the data race in this post is superb. Worth reading just for that.

Handrolled concurrency using atomic integers in C. Without the proof system or state machine model to support it. Seems likely that's not going to be their only race condition.

[+] re|2 years ago|reply
The animations also stood out to me. I took a look at the source to see if they used a library or were written completely by hand and was surprised to see that they were entirely CSS-based, no JavaScript required (although the "metadata" overwrite animation is glitched if you disable JS, for some reason not immediately apparent to me).
[+] russell_sears|2 years ago|reply
This looks like the sort of bug I'd write back when I used mutexes to write I/O routines. These days, I'd use a lock-free state machine to encode something like this:

   NOT_IN_CACHE -> READING -> IN_CACHE
(the real system would need states for cache eviction, and possibly page mutation).

Readers that encounter the READING state would insert a completion handler into a queue, and readers transitioning out of the READING state would wake up all the completion handlers in the queue.

I've been working on an open source library and simple (manual) proof system that makes it easy to verify that the queue manipulation and the state machine manipulation are atomic with respect to each other:

https://docs.rs/atomic-try-update/0.0.2/atomic_try_update/

The higher level invariants are fairly obvious once you show that the interlocks are correct, and showing the interlocks are correct is just a matter of a quick skim of the function bodies that implement the interlocks for a given data type.

I've been looking for good write ups of these techniques, but haven't found any.

[+] zogomoox|2 years ago|reply
That not-invented-here locking mechanism was a big shock to me. I'd be very interested to know the rationale behind that, are locking primitives somehow not available in file system code?
[+] Sakos|2 years ago|reply
Anybody know how the visualisation was done?
[+] n8henrie|2 years ago|reply
I've been overall very happy with: - Arch on BTRFS RAID1 root, across 3 dissimilar NVME drives (about 7 years, 3 drive replacements for hardware failure, I don't think ZFS supports this configuration) - numerous low-power systems (like Pi3s) on BTRFS root (also going on 7 years for several of these, lighter on resources than ZFS) - Asahi NixOS on BTRFS root (kernel doesn't support ZFS)

My NAS and several larger datasets are on ZFS based on reputation alone, but honestly I've had more data loss scares with ZFS than BTRFS (drives that have disappeared for no reason then reappeared hours later, soft locked and unable to unmount indefinitely, several unfortunate user-error issues with automounting datasets overlaying necessary top-level directories and preventing successful boots), and I find the BTRFS tooling more intuitive.

For my hobbyist-level homelab type needs, I would say I'm overall pretty happy with BTRFS. The only issue I've never been able to resolve is regarding lockups when issuing quotas -- another reason I stick to ZFS for my spinning rust storage drives.

Oh, and the ZFS ability to mount a zvol as a foreign filesystem (words?) lets me `btrfs send` backups to ZFS, which is nice!

[+] frankjr|1 year ago|reply
> about 7 years, 3 drive replacements for hardware failure

You've had 3 SSD failures in just 7 years? Any more details about those disks?

[+] cies|2 years ago|reply
I had my /home on a subvolume (great as the sub and super share the same space).

When I wanted to reinstall I naively thought I could format the root super volume and keep /home subvolume -- but this was impossible: I had to format /home as well according to the OpenSUSE Tumbleweed installer.

Major problem for me. I now have separate root (btrfs) and home (ext4) partitions.

[+] vetinari|2 years ago|reply
That's how subvolumes work; not just on btrfs, but on zfs as well.

The non-footgun way is to have several subvolumes (@root, @boot, @home for btrfs, rpool/ROOT/system_instance and rpool/USERDATA for zfs), and nothing in else in the volume itself. Then, you wipe the system subvolume and create a new one, with new system. Or just keep the old system subvolume and create a new one.

[+] stryan|2 years ago|reply
You can do it but it's not a very happy path. Easiest way is probably to map the old home subvolume into a different path and either re-label the subvolumes once you're installed or just copy everything over.

Separate BTRFS root and ext4 home partitions is either the default filesystem layout now if you're not doing FDE or the second recommended one.

[+] londons_explore|2 years ago|reply
One btrfs bug which is 100% reproducible:

* Start with an ext3 filesystem 70% full.

* Convert to btrfs using btrfs-convert.

* Delete the ext3_saved snapshot of the original filesystem as recommended by the convert utility.

* Enable compression (-o compress) and defrag the filesystem as recommended by the man page for how to compress all existing data.

It fails with out of disk space, leaving a filesystem which isn't repairable - deleting files will not free any space.

The fact such a bug seems to have existed for years, with such a basic following of the man pages for a common use case (migration to btrfs to make use of its compression abilities to get more free space), tells me that it isn't yet ready for primetime.

[+] zaggynl|2 years ago|reply
For what it's worth, I'm happy with using Btrfs on OpenSUSE Tumbleweed and the provided tooling like snapper and restoring Btrfs snapshots from grub, saved me a few times.

SSD used: Samsung SSD 970 PRO 1T, same installation since 2020-05-02.

[+] applied_heat|2 years ago|reply
Btrfs seems to work perfectly well in synology NAS. It must be some other combination of options or functions in use from what is available in synology that garners the bad reputation
[+] jcalvinowens|2 years ago|reply
This bug is very very rare in practice: all my dev and testing machines run btrfs, and I haven't hit it once in 100+ machine-hours of running on 6.8-rc.

The actual patch is buried at the end the article: https://lore.kernel.org/linux-btrfs/1ca6e688950ee82b1526bb30...

[+] lxgr|2 years ago|reply
100 error-free machine hours isn’t exactly evidence of anything when it comes to FS bugs, though.
[+] tavianator|2 years ago|reply
Do they run btrfs on top of dm-crypt? I suspect it's impossible to reproduce on a regular block device.
[+] Daunk|2 years ago|reply
I recently tried (for the first time) Btrfs on my low-end laptop (no snapshots), and I was surprised to see that the laptop ran even worse than it usually does! Turns out there was something like a "btrfs-cleaner" (or similar) running in the background, eating up almost all the CPU at all time. After about 2 days I jumped over to ext4 and everything ran just fine.
[+] mritzmann|2 years ago|reply
Had a similar problem but can't remember the Btrfs process. Anyway, after I switched off Btrfs quotas, everything was fine.
[+] nolist_policy|2 years ago|reply
What was your workload? Do you have quotas enabled? Compression? Are you running OpenSuse by any chance?
[+] mustache_kimono|2 years ago|reply
> I recently tried Btrfs on my low-end laptop (no snapshots)

Do snapshots degrade the performance of btrfs?

[+] eru|2 years ago|reply
Interesting that the 'cleaner' doesn't run as nice?
[+] e145bc455f1|2 years ago|reply
Just last week my btrfs filesystem got irrecoverably corrupted. This is like the fourth time it has happened to me in the last 10 years. Do not use it in consumer grade hardware. Compared to this, ext4 is rock solid. It was even able to survive me accidentally passing the currently running host's hard disk to a VM guest, which booted from it.
[+] londons_explore|2 years ago|reply
> It was even able to survive me accidentally passing the currently running host's hard disk to a VM guest, which booted from it.

I have also done this, and was also happy that the only corruption was to a handful of unimportant log files. Part of a robust filesystem is that when the user does something stupid, the blast radius is small.

Other less-smart filesystems could easily have said "root of btree version mismatch, deleting bad btree node, deleting a bunch of now unused btree nodes, your filesystem is now empty, have a nice day".

[+] gmokki|2 years ago|reply
I have had same btrfs filesystem in use for 15+ years, with 6 disks of various sizes. And all hardware components changed at least once during the fileystsen lifetime.

Worst corruption was when one DIMM started corrupting data. As a result computer kept crashing and eventually refused to mount because of btrfs checksum mismatches.

Fix was to buy new HW. Then run btrfs filesystem repairs, which failed at some point but at least got the filesystem running as long as I did not touch the most corrupted locations, luckily it was RAID1 so most checksums had a correct value on another disk. Unfortunately the checksum tree had on two locations corruption on both copies. I had to open the raw disks with hex editor and change the offending byte to correct value, after which the filesystem has been running again smoothly for 5 years.

And to find the location to modify on the disks I built a custom kernel that printed the expected value and absolute disk position when it detected the specific corruption. Plus had to ask a friend to double check my changes since I did not have any backups.

[+] londons_explore|2 years ago|reply
> last week my btrfs filesystem got irrecoverably corrupted.

This is 2 bugs really. 1, the file system got corrupted. 2, tooling didn't exist to automatically scan through the disk data structures and recover as much of your drive as possible from whatever fragments of metadata and data were left.

For 2, it should happen by default. Most users don't want a 'disk is corrupt, refusing to mount' error. Most users want any errors to auto-correct if possible and get on with their day. Keep a recovery logfile with all the info needed to reverse any repairs for that small percentage of users who want to use a hex editor to dive into data corruption by hand.

[+] nolist_policy|2 years ago|reply
Best send a bugreport to the btrfs mailing list at [email protected].

If possible include the last kernel log entries before it corrupted. Include kernel version, drive model and drive firmware version.

[+] thequux|2 years ago|reply
Huh. I've been running btrfs on a number of systems for probably 12 years at this point. One array in particular was 12TiB of raw storage used for storing VM images in heavy use. Each disk had ~9 years of spindle-on time before I happened to look closely at the SMART output and realized that they were all ST3000DM001's and promptly swapped them all out. The only issue I've ever run into is running out of metadata chunks and needing to rebalance, and that was just once.
[+] matheusmoreira|2 years ago|reply
> Compared to this, ext4 is rock solid.

Ext4 is the most reliable file system I have ever used. Just works and has never failed on me, not even once. No idea why btrfs can't match its quality despite over a decade of development.

[+] riku_iki|2 years ago|reply
how do you know it was issue with FS and not actual hardware/disk?..
[+] champtar|2 years ago|reply
Bad DIMM is a thing, even more so on consumer HW that lack ECC. I recommend you run memtest
[+] west0n|2 years ago|reply
The vast majority of databases currently recommend using XFS.
[+] londons_explore|2 years ago|reply
All modern databases do large block streaming appending writes, and small random reads, usually of just a handful of files.

It ought to be easy to design a filesystem which can have pretty much zero overhead for those two operations. I'm kinda disappointed that every filesystem doesn't perform identically for the database workload.

I totally understand that different file systems would do different tradeoffs affecting directory listing, tiny file creation/deletion, traversing deep directory trees, etc. But random reads of a huge file ought to perform near identically to the underlying storage medium.

[+] west0n|2 years ago|reply
I'm very curious if there are any databases running in production environments on Btrfs or ZFS.
[+] apitman|2 years ago|reply
I really wish you could use custom filesystems such as btrfs with WSL2. I don't think there's currently any way to do snapshotting, which means you can never be sure a backup taken within WSL is corrupt.
[+] aseipp|2 years ago|reply
Hell, even just being able to use XFS would be an improvement, because ext4 has painful degradation scenarios when you hit cases like exhausting the inode count.

(Somewhat related, but there has been a WIP 6.1 kernel for WSL2 "in preview" for a while now... I wonder why it hasn't become the default considering both it and 5.12 are LTS... For filesystems like btrfs I often want a newer kernel to pick up every bugfix.)

[+] xyzzy_plugh|2 years ago|reply
I'm so sad. I was in the btrfs corner for over a decade, and it saddens me to say that ZFS has won. But it has.

And ZFS is actually good. I'm happy with it. I don't think about it. I've moved on.

Sorry, btrfs, but I don't think it's ever going to work out between us. Maybe in a different life.

[+] streb-lo|2 years ago|reply
As someone who uses a rolling release, I use btrfs because I don't want to deal with keeping ZFS up to date.

It's been really good for me. And btrbk is the best backup solution I've had on Linux, btrfs send/receive is a lot faster than rsync even when sending non-incremental snapshots.

[+] kccqzy|2 years ago|reply
Same here: I use a rolling release and btrfs. Personally I really enjoy btrfs's snapshot feature. Most of the time when I need backups it's not because of a hardware failure but because of a fat finger mistake where I rm'ed a file I need. Periodic snapshots completely solved that problem for me.

(Of course, backing up to another disk is absolutely still needed, but you probably need it less than you think.)

[+] yjftsjthsd-h|2 years ago|reply
Depends on the rolling release; some distros specifically provide a kernel package that is still rolling, but is also always the latest version compatible with ZFS.
[+] matja|2 years ago|reply
> I don't want to deal with keeping ZFS up to date

That's what DKMS is for, which most distros use for ZFS. Install and forget.

[+] TZubiri|2 years ago|reply
reminder that btrfs stands for butter filesystem and not better filesystem