top | item 8302303

The State of ZFS on Linux

201 points| ferrantim | 11 years ago |clusterhq.com | reply

122 comments

order
[+] ownedthx|11 years ago|reply
At a previous job, we built a proof-of-concept Sinatra service (i.e., HTTP/RESTful service) that would, on a certain API call, clone from a specified snapshot, and also create an iscsi target to that new clone. This was on OpenIndiana initially, then some other variant of that OS as a second attempt.

The client making the HTTP request was IPXE; so, every time the machine booted, you'd get yourself a flesh clone + iscsi target and we'd then mount that ISCSI target in IPXE, which would then hand off the ISCSI target to the OS and away you'd go.

The fundamental problem we hit was that there was a linear delay for every new clone; the delay seemed to be 'number of clones * .05 second' or so. This was on extremely fast hardware. It was the ZFS command to clone that was going to slowly.

Around 500 clones, we'd notice these 10/20 second delays. The reason that hurt so bad is that, to our understanding, it wasn't safe to do ZFS commands or ISCSI commands in a parallel manner; the Sinatra service was responsible for serializing all ZFS/ISCSI commands.

So my question to the author:

1) Does this 'delay per clones' ring familiar to you? Does ZFS on Linux have the same issue? It was a killer for us, and I found a thread eventually that implied it would not ever get fixed in Solaris-land.

2) Can you execute concurrent ZFS CLI commands on the OS? Or is that dangerous like we found it to be on Solaris?

[+] ryao|11 years ago|reply
1. I am not aware of this specific issue. However, I am aware of an issue involving slow pool import with large numbers of zvols. Delphix has developed a fix for it that implements prefix. It should be merged into various Open ZFS platforms soon. It could resolve the problem that you describe.

2. Matthew Ahrens' synctask rewrite fixed this in Open ZFS. It took a while for the fix to propagate to tagged releases, but all Open ZFS platforms should now have it. ZoL gained it with the 0.6.3 release. Here is a link to a page with links to the commits that added this to each platform as well as the months in which the were added:

http://open-zfs.org/wiki/Features#synctask_rewrite

[+] prakashsurya|11 years ago|reply
I can't recall the exact detail from memory, but I believe #1 has to do with the fact that zfs creates/clones/snapshots/etc are done in "syncing" context. Thus, each command has to wait for a full pool sync to complete, limiting the rate at which these can be done.

This is a known problem, and likely to be fixed in the not too distant future.

[+] ryao|11 years ago|reply
I am the author. Feel free to respond with questions. I will be watching for questions throughout the day.
[+] IgorPartola|11 years ago|reply
Is there a plan at some point to include a daemon or a cron job to run automatic zpool scrubbing? I believe this was a feature that is available in other OSs' packages, but not currently with ZoL. Currently, I include two cronjobs, like so:

    18 * * * * /sbin/zpool list | grep ztank | grep ONLINE > /dev/null || /sbin/zpool status
    35 1 * * 4 /sbin/zpool scrub ztank
This way cron simply emails me if there are errors. However, I'd like a lot more communication from my storage array: if there are detected errors, I want to know right away.

My other question is much more specific to my case. Stupidly, I bought the Western Digital Green drives for my server (running in as a mirror of two drives). They are normally under no heavy load: just occasional file access to store/retrieve pictures, documents, or stream video. How likely am I to run into problems/should I replace these drives ASAP?

[+] JeremyNT|11 years ago|reply
I very much adore ZoL. Thank you for your efforts. Everything critical works and works very well.

While I get the sense that this is probably not the focus of your own work, do you have any thoughts on the maturity of the "share" facilities when using ZoL, and as the project matures will these become more of a priority? These are "nice to have" features that are obviously of relatively low importance.

You mentioned shareiscsi is unimplemented, but I also find that its friends sharenfs and especially sharesmb have some rough edges as well. When I moved a pool from an OI machine to a Linux machine, I had to massage all of my share attributes to make them function.

[+] Nursie|11 years ago|reply
I don't really have a question, but as a ZoL user I'd just like to say thanks for all the hard work.

It makes management of my disk arrays pretty painless and has some fantastic migration/recovery stuff going on. All of which I'm sure you know!

[+] Rapzid|11 years ago|reply
I followed you and Brian's contributions quite closely at my previous job. We had some pretty extreme backup targets for our VPS's and the old tools were starting to become bothersome; particularly for keeping a couple hundred million files sync'd between data centers. I knew about ZFS's send/receive but we were a linux shop.. About the time we(I) were going to make some major changes to the backup systems I gave ZFS another Google, as I like to re-check my assumptions every now and then, and discovered ZoL went "stable" just that month! I immediately pushed to give it a spin and the rest was history. Learning ZFS was fantastic fun. It challenged everything I thought a filesystem was capable off. L2Arc, snapshots(cloning and shared data wut?!), ZVol's, checksum's, and on and on and on. Thanks for all your hard work and making this possible!
[+] foobarqux|11 years ago|reply
Can you talk about performance of virtual machine disks on ZFS? How is ZFS better/worse than BTRFS?
[+] Sanddancer|11 years ago|reply
What's performance like on ZoL compared to Solaris/OpenIndiana/FreeBSD?
[+] astral303|11 years ago|reply
Tried using ZFS in the earnest and got spooked, felt it was not production ready. Wanted to use ZFS for MongoDB on Amazon Linux (primarily for compression, but also for snapshot functionality for backups). Tried 0.6.2.

Ended up running into a situation where a snapshot delete hung and none of my ZFS commands were returning. The snapshot delete was not killable with kill -9. https://github.com/zfsonlinux/zfs/issues/1283

Also, under load encountered a kernel panic or a hang (I forget), turns out it's because the Amazon Linux kernel comes compiled with no preemption. It seems that "voluntary preemption" is the only setting that's reliable. https://github.com/zfsonlinux/zfs/issues/1620

That left a bad taste in my mouth. Might be worth trying out 0.6.3 again.

I am still leafing through the issues closed in 0.6.3, but based on what I see, 0.6.2 did not seem production-ready-enough for me:

https://github.com/zfsonlinux/zfs/issues?page=2&q=is%3Aissue...

[+] ryao|11 years ago|reply
Your deadlock was likely caused by the sole regression to get by us in the 0.6.2 release:

https://github.com/zfsonlinux/zfs/commit/a117a6d66e5cf1e9d4f...

This occurred because it was rare enough that neither us nor the buildbots caught it back in Feburary. George Wilson wrote a fix for it in Illumos rather promptly. However, Illumos and ZoL projects had different formats for the commit titles of regression fixes. In specific, the Illumos developers would reuse the same exact title while the ZoL developers would generally expect a different title, so we missed it when merging work done in Illumos. I caught it in November when I was certain that George had made a mistake and noticed that our code and the Illumos code was different. It is fixed in 0.6.3. The fix was backported to a few distribution repositories, but not to all of them.

The 0.6.3 release was notable for having a very long development cycle. As I described in the blog post, the project will begin doing official bug fix releases when 1.0 is tagged. That should ensure that these fixes become available to all distributions much sooner. In the mean time, future releases are planned to have much shorter development cycles than 0.6.3 had, so fixes like this will become available more quickly.

That being said, I was at the MongoDB office in NYC earlier this year to troubleshoot poor performance on MongoDB. I will refrain from naming the MongoDB developer with whom I worked lest he become flooded with emails, but my general understanding is that 0.6.3 resolved the performance issues that the MongoDB had observed. Future releases should further increase performance.

[+] agapon|11 years ago|reply
Great blog post! Something from personal experience. OpenZFS on FreeBSD feels mostly like a port of illumos ZFS where most of the non-FreeBSD-specific changes happen in illumos and then get ported downstream. On the other hand, OpenZFS on Linux feels like a fork. There is certainly a stream of changes from illumos, but there's a rather non-trivial amount of changes to the core code that happen in ZoL.
[+] ryao|11 years ago|reply
This is because Martin Matuška of FreeBSD has been focused on upstreaming changes made in FreeBSD's ZFS port into Illumos. At present, the ZFSOnLinux project has had no one dedicated to that task and code changes mostly flow from Illumos to Linux. This is starting to change. A small change went upstream to Illumos earlier this year and more should follow in the future.

That being said, there are commonalities between Illumos and FreeBSD that make it easier for the FreeBSD ZFS developers to collaborate with their Illumos counterparts:

1. FreeBSD and Illumos have large kernel stacks (4 pages and 6 pages respectively) while Linux's kernel stacks are limited to 2 pages.

2. In-kernel virtual memory is well supported in FreeBSD and Illumos while Linux's in-kernel virtual memory is crippled for philosophical reasons.

3. FreeBSD and Illumos have both the kernel and userland in the same tree. FreeBSD even maintained Illumos' directory structure in its import of the code while ZoL's project lead decided to refactor it to be more consistent with Linux.

Difficulties caused by these differences should go away changes made in ZoL to improve code portability are sent back to Illumos.

[+] bussiere|11 years ago|reply
I may have read the article too fast , but what about cryptography in zol ? is there a way to crypt data on zol ? regards and thks for the article
[+] ryao|11 years ago|reply
At present, you need to either encrypt the block devices beneath ZFS via LUKS or the filesystem on top of ZFS via ecryptfs. There are some guides on how to do this for each distribution.

There is an open issue for integrating encryption into ZoL itself:

https://github.com/zfsonlinux/zfs/issues/494

This will likely be added to ZoL in the future, but no one is actively working on it at this time.

[+] feld|11 years ago|reply
Oracle extended ZFS to be able to encrypt specific filesystems, but this method has been heavily scrutinized for being susceptible to watermarking attacks
[+] a2743906|11 years ago|reply
I'm using ZFS right now, because I need something that cares for data integrity, but the fact that it will never be included in Linux is a very big issue for me. Every time you upgrade your kernel, you have to upgrade the separate modules as well - this is the point where bad things can happen. I will definitely be looking into Btrfs once it is more reliable. For now I'm having a bit of a problem with SSD caching and performance, but don't care about it enough for it to be relevant, I just use the filesystem to store data safely and ZFS does an OK job.
[+] Andys|11 years ago|reply
I used ZFSonLinux on my laptop and workstation for a couple of years now, with Ubuntu, without any major problems. When I tried to use it in production, I didn't get data loss but I hit problems:

* Upgrading is a crapshoot: Twice, it failed to remount the pool after rebooting, and needed manual intervention.

* Complete pool lockup: in an earlier version, the pool hung and I had to reboot to get access to it again. If you look through the issues on github, you'll see weird lockups or kernel whoopsies are not uncommon.

* Performance problems with NFS: This is partially due to the linux NFS server sucking, but ZFS made it worse. Used alot of CPU compared to solaris or freebsd, and was slow. Its even slow looping back to localhost.

* Slower on SSDs: ZFS does more work than other filesystems, so I found that it used more CPU time and had more latency on pure SSD-backed pools.

* There are alternatives to L2ARC/ZIL on linux, are built-in, and work with any filesystem, such as "flashcache" on ubuntu.

For these reasons, I think ZoL is good for "near line" and backups storage, where you have a large RAID of HDDs and need stable and checksummed data storage, but not mission critical stuff like fileservers or DBs.

[+] ryao|11 years ago|reply
I mentioned most of these issues in the supplementary blog posts. Here is where each stands:

* There are issues when upgrading because the initramfs can store an old copy of the kernel module and the /dev/zfs interface is not stabilized. This will be addressed in the next 6 months by a combination of two things. The first is /dev/zfs stabilization. The second is bootloader support for dynamic generation of initramfs archives. syslinux does this, but it does not at this time support ZFS. I will be sending Peter Alvin patches to add ZFS support to syslinux later this year. Systems using the patched syslinux will be immune to this problem while systems using GRUB2 will likely need to rely on the /dev/zfs stabilization.

* There are many people who do not have problems, but this is certainly possible. Much of the weirdness should be fixed in 0.6.4. In particular, I seem to have fixed a major cause of rare weirdness in the following pull requests, which had the side benefit of dramatically increasing performance in certain workloads:

https://github.com/zfsonlinux/spl/pull/369 https://github.com/zfsonlinux/zfs/pull/2411

* The above pull requests have a fairly dramatic impact on NFS performance. Benchmarks shown to me by SoftNAS indicate that all performance metrics have increased anywhere from 1.5 to 3 times. Those patches have not yet been merged as I need to address a few minor concerns from the project lead, but those will be rectified in time for 0.6.4. Additional benchmarks by SoftNAS have shown that the following patch that was recently merged increases performance another 5% to 10% and has a fairly dramatic effect on CPU utilization:

https://github.com/zfsonlinux/zfs/commit/cd3939c5f06945a3883...

* There is opportunity for improvement in this area, but it is hard for me to tell what you mean. In particular, I am not certain if you mean minimum latency, maximum latency, average latency or the distribution of latency. In the latter case, the following might be relevant:

https://twitter.com/lmarsden/status/383938538104184832/photo...

That said, I believe that the kmem patches that I linked above will also have a positive impact on SSDs. They reduce contention in critical code paths that affect low latency devices.

Additionally, there is at least one opportunity to improve our latencies. In particular, ZIL could be modified to use Force Unit Access instead of flushes. The problem with this is that not all devices honor Force Unit Access, so making this change could result in data loss. It might be possible to safely make it on SLOG devices as I am not aware of any flash devices that disobey Force Unit Access. However, data integrity takes priority. You can test whether a SLOG device would make a difference in latencies by setting sync=disabled temporarily for the duration of your test. All improvements in the area of SLOG devices will converge toward the performance of sync=disabled. If sync=disabled does not improve things, the bottleneck is somewhere else.

* These alternatives operate on the block device level and add opportunity for bugs to cause cache coherence problems that are damaging to a filesystem on top. They are also unaware of what is being stored, so they cannot attain the same level of performance as a solution that operates on internal objects.

[+] ryao|11 years ago|reply
I have been inundated with feedback from a wide number of channels. If I did not reply to a comment today, I will try to address it tomorrow.
[+] ashayh|11 years ago|reply
ZFS, and most* other file systems are all about _one_ computer system.

While ZFS data integrity features may be useful, they don't prevent the wide variety of things that can go wrong on a _single_ computer. You still need site redundancy, multiple physical copies, recovery from user errors etc.

Large, modern enterprises are better off keeping data on application layer "filesystems" or databases, since they can more easily aggregate the storage of hundreds or thousands of physical nodes. ZFS doesn't help with anything special here.

For the average home user, ZoL modules are a hassle to maintain. You are better of setting up FeeNAS on a 2nd computer if you really want to use ZFS. Otherwise there is nothing much over what XFS, EXT4 or btrfs can offer.

The 'ssm' set of tools to manage LVM, and other built in file systems, is more easier for home users with regular needs.

GlusterFS and others are distributed file systems, but suffers from additional complexity at the OS and management layer.

[+] ryao|11 years ago|reply
The Lustre filesystem is able to use ZFSOnLinux for its OSDs. This gives it end to end checksum capabilities that I am told enabled the Lustre developers to catch buggy NIC drivers that were silently corrupting data.

Alternatively, there is a commercial Linux distribution called SoftNAS that implements a proprietary feature called snap replicate on top of ZFS send/recv. This is allows it to maintain backups across availability zones and is achieved by its custom management software running the zfs send/recv commands as per user requests.

In the interest of full disclosure, my 2014 income tax filing will include income from consulting fees that SoftNAS paid me to prioritize fixes for bugs that affected them. I received no money for such services in prior tax years.

[+] mbreese|11 years ago|reply
I love ZFS, and I love working with Linux, but I can't help but worry about using ZFS on Linux. Without the needed support from the kernel side, I don't see how it can be useful for production. I can see using it on personal workstations, but for any situation where data loss is critical, you just won't see any uptake. Because of the licensing, ZFS can never be anything more than a second-class citizen on Linux.

That said, I run a FreeBSD ZFS file server just to host NFS that is exported over to a Linux cluster. At least on FreeBSD, there is first-class integration of ZFS into the OS. (I used to also maintain a Sun cluster that had a Solaris ZFS storage server that exported NFS over to Linux nodes, which is where I first got a taste for ZFS).

So, I guess my main question is: In what use cases is ZFS on Linux so useful when native FreeBSD/ZFS support exists?

I'm not saying it can't be done - I just don't understand why.

[+] michael_h|11 years ago|reply

  Without the needed support from the kernel side
Can you clarify what you mean by that?
[+] leonroy|11 years ago|reply
I've used ZFS (FreeNAS) for quite a few years and find it pretty flawless. Trust it's not too dumb a question but what advantage is there to running ZFS on Linux when you can run it on variants of Solaris or BSD just fine?
[+] DiabloD3|11 years ago|reply
I've used ZoL since it was created, and zfs-fuse before that. I ran it on my workstation for a few years (managing a 4x750gb RAID-Z (= ZFS's RAID-5 impl), with ext3 on mdadm RAID 1 2x400gb root), and then swapped to BTRFS for 2x2TB BTRFS native RAID 1 (which was Oracle's ZFS competitor that seems to be largely abandoned although I see commits in the kernel changelog periodically), and now back to ZFS on a dedicated file server using 2x128GB Crucial M550 SSD + 2x2TB, setup as mdadm RAID 1 + XFS for the first 16GB of the SSDs for root[2], 256MB on each for ZIL[1], and the rest as L2ARC[3], and the 2x2TB as ZFS mirror. I honestly see no reason to use any other FS for a storage pool, and if I could reliably use ZFS as root on Debian, I wouldn't even need that XFS root in there.

All of this said, I get RAID 0'ed SSD-like performance with very high data reliability and without having to shell out the money for 2TB of SSD. And before someone says "what about bcache/flashcache/etc", ZFS had SSD caching before those existed, and ZFS imo does it better due to all the strict data reliability features.

[1]: ZFS treats multiple ZIL devs as round robin (RAID 0 speed without increased device failure taking down all your RAID 0'ed devices). You need to write multiple files concurrently to get the full RAID 0-like performance out of that because it blocks on writing consecutive inodes, allowing no more than one in flight per file at a time. ZIL is only used for O_SYNC writes, and it is concurrently writing to both ZIL and the storage pool, ie, ZIL is not a write-through cache but a true journal.

The failure of a ZIL device is only "fatal" if the machine also dies before ZFS can write to the storage pool, and the mode of the failure cannot leave the filesystem in an inconsistent state. ZFS does not currently support RAID for ZIL devices internally, nor is it recommended to hijack this and use mdadm to force it. It only exists to make O_SYNC work at SSD speeds.

[2]: /tank and /home are on ZFS, the rest of the OS takes up about 2GB of that 16GB. I oversized it a tad, I think. If I ever rebuild the system, I'm going for 4GB.

[3]: L2ARC is a second level storage for ZFS's in memory cache, called ARC. ARC is a highly advanced caching system that is designed to increase performance by caching often used data obsessively instead of being just a blind inode cache like the OS's usual cache is, and is independent of the OS's disk cache. L2ARC is sort of like a write through cache, but is more advanced by making a persistent version of ARC that survive reboots and is much larger than system memory. L2ARC is implicitly round robin (like how I described ZIL above), and survives the loss of any L2ARC dev with zero issues (it just disables the device, no unwritten data is stored here). L2ARC does not suffer from the non-concurrent writing issue that ZIL "suffers" (by design) from.

[+] rodgerd|11 years ago|reply
> (which was Oracle's ZFS competitor that seems to be largely abandoned although I see commits in the kernel changelog periodically)

Under heavy development, officially supported by most of the major commercial distros, and still designated by Linus as the ext* replacement as the standard Linux filesystem.

[+] foobarqux|11 years ago|reply
Can you speak more about why ZFS is better than BTRFS?
[+] ryao|11 years ago|reply
Debian is the last major Linux distribution where it is not easy to do / on ZFS. It might be possible to implement a module for Debian's initramfs generator in ZoL upstream. I suggest filing an issue to inquire about this possibility:

https://github.com/zfsonlinux/zfs/issues/new

[+] khc|11 years ago|reply
And then how do you use this file server to take advantage of the RAID 0'ed SSD-like performance? Do you export it as NFS? iSCSI?
[+] turrini|11 years ago|reply
I've created the script below a while (year) ago. It (deb)bootstrap a working Debian Wheezy with ZFS on root (rpool) using only 3 partitions: /boot(128M) swap(calculated automatically) rpool(according to # of your disks, mirrored or raidz'ed).

All commentaries are in Brazilian Portuguese. I didn't have time to translate it to English. Someone could do it and fill a push request.

https://github.com/turrini/scripts/blob/master/debian-zol.sh

Hope you like it.

[+] ryao|11 years ago|reply
Thanks for sharing. I will let Debian users interested in / on ZFS know that this is available as they ask me about this sort of thing.
[+] andikleen|11 years ago|reply
When swap doesn't work, mmap is unlikely to work correctly either.

Figuring out why that is so is left as an exercise for the poster.

[+] nailer|11 years ago|reply
Putting production data on a driver maintained outside the mainline Linux kernel is a bad idea.

That isn't a licensing argument - I'm happy to use a proprietary nvidia.ko for gaming tasks, for example, because I won't be screwing up anyone's data if it breaks.

[+] 1amzave|11 years ago|reply
Maybe the odds of it doing so are smaller, but if you think a broken nvidia.ko can't screw up anyone's data I think you're simply mistaken.

Say Nvidia's driver has a use-after-free bug: it kmalloc()s a buffer, kfree()s it, then a filesystem kmalloc()s something and gets allocated the same buffer. If nvidia.ko then decides it still wants to use that buffer and writes something into it...kablooie.

Unless you start running some microkernel-ish thing with drivers each running in their own distinct address spaces, you're going to have a hard time avoiding this possibility.

[+] ryao|11 years ago|reply
You could be "screwing up" someone's data if an in-tree filesystem breaks. If you read the supplementary blog posts, you would have seen the following:

http://lwn.net/Articles/437284/

Nearly all in-tree filesystems can fail in the same way described there. ZFS cannot. That being said, no filesystem is a replacement for backups. This applies whether you use ZFS or not. If you care about your data, you should have backups.

[+] mrmondo|11 years ago|reply
While I like most parts of ZFS, these days BTRFS is both stable and performs well with a decent feature set. We moved from ZFS and EXT4 to BTRFS for a good portion of our production servers last year - and we haven't looked back.
[+] thijsb|11 years ago|reply
Do you run RAID5/6? I had that running for half year, and it crashed often.

Now on ZFS (raidz) and it works flawless

[+] seoguru|11 years ago|reply
I have a laptop running ubuntu with a single SSD. Does it make sense to run it with ZFS to get compression and snapshots? If I add a hard drive, again does it make sense (perhaps using SSD as cache (arc?) )
[+] fsckin|11 years ago|reply
I've never heard of someone using ZFS with a single disk. You're probably better off with ext4.

The compression and deduplication features of ZFS is terrific on network filers. Compression could possibly improve performance slightly on a single disk system.

With two disks, I'd say you'd probably be better off with running RAID0 (or no RAID at all) and having a great backup plan. Using another SSD to cache writes to another SSD doesn't make a whole lot of sense to me.

[+] curiousbiped|11 years ago|reply
Well, with ZFS you could squirt snapshots of /home to another box for backups. Since they're just the changed blocks, they'd be fairly small.
[+] ryao|11 years ago|reply
Yes to using ZFS for your root drive. Probably no to using L2ARC in a laptop.
[+] awonga|11 years ago|reply
I've looked into ZFS before for distributions like freenas, is there any solution on the horizon for the massive memory requirements?

For example, needing 8-16gb ram for something like a xTB home nas is high.

[+] ryao|11 years ago|reply
The "massive memory requirements" only exist if you use data deduplication and care about write performance. Otherwise, ZFS does not require very much memory to run. It has a reputation to the contrary because ARC's memory usage is not shown as cache in the kernel's memory accounting, even though it is cache. This is in integration issue that would need to be addressed in Linus' tree.