(no title)
klempner | 1 month ago
While the advice is sound, this number isn't the right number for this argument.
That 10^15 number is for UREs, which aren't going to cause silent data corruption -- simple naive RAID style mirroring/parity will easily recover from a known error of this sort without any filesystem layer checksumming. The rates for silent errors, where the disk returns the wrong data that benefit from checksumming, are a couple of orders of magnitude lower.
allanjude|1 month ago
Without a checksum, hardware RAID has no way to KNOW it needs to use the parity to correct the block.
klempner|1 month ago
iberator|1 month ago
thfuran|1 month ago
digiown|1 month ago
The consumer grade drives are often given an even lower spec of 1 in 1e14. For a 20TB drive, that's more than one error every scrub, which does not happen. I don't know about you, but I would not consider a drive to be functional at all if reading it out in full would produce more than one error on average. Pretty much nothing said on that datasheet reflects reality.
alexfoo|1 month ago
I would expect the ZFS code is written with the expected BER in mind. If it reads something, computes the checksum and goes "uh oh" then it will probably first re-read the block/sector, see that the result is different, possibly re-read it a third time and if all OK continue on without even bothering to log an obvious BER related error. I would expect it only bothers to log or warn about something when it repeatedly reads the same data that breaks the checksum.
Caveat Reddit but https://www.reddit.com/r/zfs/comments/3gpkm9/statistics_on_r... has some useful info in it. The OP starts off with a similar premise that a BER of 10^-14 is rubbish but then people in charge of very large pools of drives wade in with real world experience to give more context.