top | item 38154598

(no title)

redsaz | 2 years ago

This isn't the first time I've heard concerns about using hashes for checking file equality. I've considered adding a "paranoid mode" to do the direct byte-for-byte checks for such folks that don't even want to introduce a so-remote-it's-virtually-impossible theoretical chance for a collision to occur.

I'd go into the math about how remote of a chance it would be (barring any discovered hash collisions) but others have explained it better than me elsewhere.

discuss

bodyfour|2 years ago

"The math" only matters for random collisions, which are effectively impossible (less likely than the CPU malfunctioning). However that tells you nothing about maliciously constructed files. Even if a hash function has no known collisions today, doesn't mean that they won't be found someday.

But as I tried to describe (probably in way too much detail) the real problem with "hash everything, compare hashes afterwards" is that it implies that you'll be doing I/O to read all of the file's contents even when it isn't needed to prove uniqueness. For a lot of common use cases (big files, few dupes) this can mean doing 1000x more I/O than you need.

Once you design the solution around avoiding unneeded I/O, you find that hashing also stops being useful.

redsaz|2 years ago

> that tells you nothing about maliciously constructed files. Even if a hash function has no known collisions today, doesn't mean that they won't be found someday.

This is what I meant by "barring any discovered hash collisions" but in retrospect I didn't make that clear enough.

Though, if you're crafting your own malicious different-content-same-size files and storing them in your NAS to cause a hash collision to make them appear the same, then I bet several governments are willing to pay top dollar for your abilities :D

Or, different scenario, say you're hosting a Dropbox-like service and you're storing files for hundreds of thousands of users, then you shouldn't be using a duplicate-file-finding util anyway, it'd be better if it was implemented at a different layer anyway.

The scenario you describe (lots of big files of same sizes, few dupes) I agree hashing the entire file would be wasteful. From my experience on my file server, when I had two or more files of the same size, and the size was larger than a few MB, they likely had the same content.

Put another way, if multiple files of the same sufficiently-large size are encountered, expect to read the entirety of those files anyway, whether hashing or checking byte-for-byte, because they are likely dupes. So, there's still potential for perf gains by avoiding hashing, but I'm willing to bet it isn't as much as one would hope/expect.

(You do have me curious as to how much difference it could make, though)

Edit: I'm also willing to admit that I have so many dupes because my backup strategy is TRASH and I have dupes everywhere, and so my scenario could be more unusual than other people.