top | item 46705554

(no title)

zigzag312 | 1 month ago

I've seen few arguments that with the amount of data we have today the 2^(32-1) chance can happen, but I can't vouch their calculations were done correctly.

Readme in SMHasher test suite also seems to indicate that 32 bits might be too few for file checksums:

"Hash functions for symbol tables or hash tables typically use 32 bit hashes, for databases, file systems and file checksums typically 64 or 128bit, for crypto now starting with 256 bit."

discuss

fc417fc802|1 month ago

That's vaguely describing common practices, not what's actually necessary or why. It also doesn't address my note that the purpose of the hash is important. Are "file systems" and "file checksums" referring to globally unique handles, content addressed tables, detection of bitrot, or something else?

For detecting file corruption the amount of data alone isn't the issue. Rather what matters is the rate at which corruption events occur. If I have 20 TiB of data and experience corruption at a rate of only 1 event per TiB per year (for simplicity assume each event occurs in a separate file) that's only 20 events per year. I don't know about you but I'm not worried about the false negative rate on that at 32 bits. And from personal experience that hypothetical is a gross overestimation of real world corruption rates.

zigzag312|1 month ago

It depends on how you calculate statistics. If you are designing a file format that over the lifetime of the format hundreds of millions of user will use (storing billions of files), what are the chances that 32 bits checksum won't be able to catch at least one corruption? During transfer over unstable wireless internet connection, storage on cheap flash drive, poor HDD with a higher error rate, unstable RAM etc. We want to avoid data corruption if we can even in less then ideal conditions. Cost of going from 32 bit to 64 bit hashes is very small.