top | item 46707137

(no title)

zigzag312 | 1 month ago

It depends on how you calculate statistics. If you are designing a file format that over the lifetime of the format hundreds of millions of user will use (storing billions of files), what are the chances that 32 bits checksum won't be able to catch at least one corruption? During transfer over unstable wireless internet connection, storage on cheap flash drive, poor HDD with a higher error rate, unstable RAM etc. We want to avoid data corruption if we can even in less then ideal conditions. Cost of going from 32 bit to 64 bit hashes is very small.

discuss

order

fc417fc802|1 month ago

No, it doesn't "depend on how you calculate statistics". Or rather you are not asking the right question. We do not care if a different person suffers a false negative. The question is if you, personally, are likely to suffer a false negative. In other words, will any given real world deployment of the solution be expected to suffer from an unacceptably high rate of false negatives?

Answering that requires figuring out two things. The sort of real world deployment you're designing for and what the acceptable false negative rate is. For an extremely conservative lower bound suppose 1 error per TiB per year and suppose 1000 TiB of storage. That gives a 99.99998% success rate for any given year. That translates to expecting 1 false negative every 4 million years.

I don't know about you but I certainly don't have anywhere near a petabyte of data, I don't suffer corruption at anywhere near a rate of 1 event per TiB per year, and I'm not in the business of archiving digital data on a geological timeframe.

32 bits is more than fit for purpose.

zigzag312|1 month ago

I can't say I agree with your logic here. We are not talking about any specific backup or anything like that. We are talking about the design of a file format that is going to be used globally.

Business running a lottery has to calculate the odds of anyone winning, not just the odds of a single person winning. Same, a designer of a file format has to consider chances for all users. What percent of users will be affected by any design decision.

For example, what if you would offer a guarantee that 32 bit hash will protect you from corruption, and compensate generously anyone who would get this type of corruption; how would you calculate probability then?