top | item 42485854

(no title)

jmakov | 1 year ago

Would be interesting to know how files get stored. They don't mention any distributed FS solutions like SeaweedFS so once a drive is full, does the file get sent to another one via some service? Also ZFS seems an odd choice since deletions (esp of small files) at +80% full drive are crazy slow.

discuss

ryao|1 year ago

Unlike ext4 that locks the directory when unlinking, ZFS is able to scale on parallel unlinking. In specific, ZFS has range locks that permit directory entries to be removed in parallel from the extendible hash trees that store them. While this is relatively slow for sequential workloads, it is fast on parallel workloads. If you want to delete a large directory subtree fast on ZFS, do the rm operations in parallel. For example, this will run faster on ZFS than a naive rm operation:

  find /path/to/subtree -name -type f | parallel -j250 rm --
  rm -r /path/to/subtree

A friend had this issue on spinning disks the other day. I suggested he do this and the remaining files were gone in seconds when at the rate his naive rm was running, it should have taken minutes. It is a shame that rm does not implement a parallel unlink option internally (e.g. -j), which would be even faster, since it would eliminate the execve overhead and likely would eliminate some directory lookup overhead too, versus using find and parallel to run many rm processes.

For something like fast mail that has many users, unlinking should be parallel already, so unlinking on ZFS will not be slow for them.

By the way, that 80% figure has not been true for more than a decade. You are referring to the best fit allocator being used to minimize external fragmentation under low space conditions. The new figure is 96%. It is controlled by metaslab_df_free_pct in metaslab.c:

https://github.com/openzfs/zfs/blob/zfs-2.2.0/module/zfs/met...

Modification operations become slow when you are at/above 96% space filled, but that is to prevent even worse problems from happening. Note that my friend’s pool was below the 96% threshold when he was suffering from a slow rm -r. He just had a directory subtree with a large amount of directory entries he wanted to remove.

For what it is worth, I am the ryao listed here and I was around when the 80% to 96% change was made:

https://github.com/openzfs/zfs/graphs/contributors

switch007|1 year ago

I discovered this yesterday! Blew my mind. I had to check 3 times that the files were actually gone and that I specified the correct directory as I couldn't believe how quick it ran. Super cool

brongondwana|1 year ago

Unlinking gets done asynchronously on the weekends from Cyrus, using the `cyr_expire` tool. Right now it only runs one unlinking process at a time on the whole machine due to historical ext4 issues ... but maybe we should revisit that now we're on ZFS and NVMe. Thanks for the reminder.

jmakov|1 year ago

Thank you very much for sharing this, very insightful.

shrubble|1 year ago

The open-source Cyrus IMAP server which they mention using, has replication built-in. ZFS also has built-in replication available.

Deletion of files depends on how they have configured the message store - they may be storing a lot of data into a database, for example.

mastax|1 year ago

ZFS replication is quite unreliable when used with ZFS native encryption, in my experience. Didn't lose data but constant bugs.

ackshi|1 year ago

Keeping enough free space should be much less of a problem with SSDs. They can tune it so the array needs to be 95% full before the slower best-fit allocator kicks in. https://openzfs.readthedocs.io/en/latest/performance-tuning....

I think that 80% figure is from when drives were much smaller and finding free space over that threshold with the first-fit allocator was harder.

brongondwana|1 year ago

Emails are stored in cyrus-imapd.

For now, the "file storage" product is a Node tree in mysql, with content stored in a content-addressed blob store, which is some custom crap I wrote 15 years ago that is still going strong because it's so simple there's not much to go wrong.

We do plan to eventually move the blob storage into Cyrus as well though, because then we have a single replication and backup system rather than needing separate logic to maintain the blob store.