(no title)
redsaz | 2 years ago
I'd go into the math about how remote of a chance it would be (barring any discovered hash collisions) but others have explained it better than me elsewhere.
redsaz | 2 years ago
I'd go into the math about how remote of a chance it would be (barring any discovered hash collisions) but others have explained it better than me elsewhere.
bodyfour|2 years ago
But as I tried to describe (probably in way too much detail) the real problem with "hash everything, compare hashes afterwards" is that it implies that you'll be doing I/O to read all of the file's contents even when it isn't needed to prove uniqueness. For a lot of common use cases (big files, few dupes) this can mean doing 1000x more I/O than you need.
Once you design the solution around avoiding unneeded I/O, you find that hashing also stops being useful.
redsaz|2 years ago
This is what I meant by "barring any discovered hash collisions" but in retrospect I didn't make that clear enough.
Though, if you're crafting your own malicious different-content-same-size files and storing them in your NAS to cause a hash collision to make them appear the same, then I bet several governments are willing to pay top dollar for your abilities :D
Or, different scenario, say you're hosting a Dropbox-like service and you're storing files for hundreds of thousands of users, then you shouldn't be using a duplicate-file-finding util anyway, it'd be better if it was implemented at a different layer anyway.
The scenario you describe (lots of big files of same sizes, few dupes) I agree hashing the entire file would be wasteful. From my experience on my file server, when I had two or more files of the same size, and the size was larger than a few MB, they likely had the same content.
Put another way, if multiple files of the same sufficiently-large size are encountered, expect to read the entirety of those files anyway, whether hashing or checking byte-for-byte, because they are likely dupes. So, there's still potential for perf gains by avoiding hashing, but I'm willing to bet it isn't as much as one would hope/expect.
(You do have me curious as to how much difference it could make, though)
Edit: I'm also willing to admit that I have so many dupes because my backup strategy is TRASH and I have dupes everywhere, and so my scenario could be more unusual than other people.