I've had a few cases data loss related to ZFS encryption, causing a total loss of a dataset and all of its ancestor snapshots. The key used by this dataset is simply missing from the keystore, and so it fails mounting with I/O error. We have no idea why or how could it happen, but the pool also had a lot of these "innocuous* bugs, while ZFS never reported a single error from the backing disks. This happened on two different full rebuilds (from scratch, using zpool-create and manual recreation of all datasets with rsync) of the same pool, but on the same hardware and with the same workload. I am 99.999% sure that this is caused by the native encryption code, probably compounded by sending very regular snapshots (not raw, though).Weirdly, this only happened on a few datasets that were not used a lot, the datasets that have lots of IO have only had the innocuous errors (the ones that refer to deleted files).
I did try debugging some of this with a ZFS developer, but we were not able to recover the data, and digged deep enough to see that something was very wrong with these datasets (it was not just a bitflip somewhere, rather that dataset used a key from the keystore that was supposed to exist, but didn't.
No comments yet.