As far as I know, the standard tool for this is rdfind. This new tool claims to be "blazingly fast", so it should provide something to show it. Ideally a comparison with rdfind, but even a basic benchmark would make it less dubious. https://github.com/pauldreik/rdfind
But the main problem is not the suspicious performance, it's the lack of explanation. The tool is supposed to "find duplicate files (photos, videos, music, documents)". Does it mean it is restricted to some file types? Does it find identical photos with different metadata to be duplicates? Compare this with rdfind which clearly describes what it does, provides a summary of its algorithm, and even mentions alternatives.
Overall, it may be a fine toy/hobby project (3 commits only, 3 months ago), I didn't read the code (except for finding the command-line options). I don't get why it got so much attention.
I think we need a lookup table of marketing speech to real-world performance metrics. Blazingly fast has been showing up a lot lately.
The cynical side of me wants to know what features and safety checks a "blazingly fast" tool has not implemented that the older "glacially slow" tool it is replacing ended up implementing after all the edge conditions were uncovered.
Do you know of any tool that does a good job of finding files that differ only in their metadata or even better can use a perceptual hash to find possible matches? Geeqie's find duplicates seems to do the latter, but afaict you can't run that function from the command line.
Over the years I've used many, many tools intended to solve this problem. In the end, after much frustration, I just use existing tools, glued together in a un*x manner.
find * -type f -exec md5sum '{}' ';' \
| tee /tmp/index_file.txt \
| gawk '{print $1}' \
| sort | uniq -c \
| gawk '/^ *1 /{ print $2 } \
> /tmp/duplicates.txt
for m in $( cat /tmp/duplicates.txt )
do
grep $m /tmp/index_file.txt
echo ========
done \
| less
Tweak as necessary. I do have a comparison executable that only compares sizes and sub-portions to save time, but I generally find it's not worth it.
It takes less time to type this that than it does to remember what some random other tool is called, or how to use it. I also have saved a variant that identifies similar files, and another that identifies directory structures with lots of shared files, but those are (understandably) more complex (and fragile).
Most programs I tested have very simple basic usage - just the program name and a list of directories. I doubt typing the above would be faster, and figuring out for the first time - definitely not. Also executing that on a million of files would take ages, even compared to slowest proper duplicate finders.
Anyway, thanks for sharing - it is always very exciting to see how far you can go with a few unix utilities and a bit of scripting :)
A shameless plug but it is a simple —and probably bad written— tool I made many years ago to scratch an itch, and I still use it.
It finds duplicate files and replaces them with hard links, saving you space. Just make sure you provide it with paths in the same filesystem.
I originally wrote it to save some space from personal files (videos, photos, etc), but it turned out very useful for tar files, docker images, websites, and more.
For example I maintain a tar file and a docker image with Kafka connectors which share many jar files. Using duphard I can save hundreds of megabytes, or even more than a gigabyte! For a documentation website with many copies of the same image (let's just say some static generators favor this practice for maintaining multiple versions), I can reduce the website size by 60%+, which then makes ssh copies, docker pulls, etc way faster speeding up deployment times.
This program uses CRC32 to compute hashes. This is a terrible idea - a 32bit hash is just too short and the probability of collisions is way too high. Only a few thousand files are enough to get 50% probability of a collision. Even though this is decreased by additional matching by extensions and sizes, I wouldn't trust it to delete any files.
Use fclones, fslint, jdupes, rdfind instead, which either use much stronger hashes (128-bit) or even verify files by direct byte-to-byte comparison.
AFAIK CRCs are not the fastest "hashes" you can get. Some non-cryptographic hashes outperform crcs by a large factor and provide longer checksums and much better statistical properties.
I've been using 'czkawka' since it's earliest inception. Seems to do a similar tast using file hashes but can also search through to match Pictures and the like.
https://github.com/qarmin/czkawka
As the author of the tool, thanks a lot for wonderful inputs! Many comments are actionable. I'll incorporate them in code soon.
Now to address a few concerns:
# The tool doesn't delete anything -- As the name suggests, it just finds duplicates. Check it out.
# File uniqueness is determined by file extension + file size + CRC32 of first 4KiB, middle 2KiB and last 2KiB
# Above seems not much. But, on my portable hard drive with >172K files (mix of video, audio, pics and source code), I got the same number of collisions as that of "SHA-256 of entire file" (By the way, I'm planning to add an option in the tool to do this)
This sounds interesting and should probably be able to run on whole system. What if you run in the files of the OS itself? e.g. Whole C drive or where Linux system files are. Will there be any collisions?
FWIW if people are interested, I wrote https://github.com/karteum/kindfs for the purpose of indexing the hard drive, with the following goals
* being able to detect not only duplicate files but also duplicate dirs (without returning all their sub-contents as duplicates)
* being able to query multiple times without having to re-scan, and to do other types of queries (i.e. I am computing a hash on all files, not only of those with duplicate sizes. This makes scanning slower but enables other use-cases. N.b. beware that I only hash fixed portions of files for files>3MB, which is enough for my use-case considering that I always triple-check the results and is a reasonable tradeoff for performance, but it might not be OK for everyone !)
* being able to tell whether all files in dir1/ are included in dir2/ (regardless of file/dir structure)
* being able to mount the sqlite index as a FUSE FS (which is convenient for e.g. diff -r or qdirstat...)
Still work-in-progress, yet it works for several of my use-cases
The first one i found and still use when it got obvious that fslint is EOL is czkawka [0] (meaning hiccup in polish). Its' speed is an order of magnitude higher than fslint, memory use is 20%-75%.
<;)> Satisfied customer, would buy it again. </;)>
On Windows if you download the same file more than once you will have foo.doc, "foo (1).doc", "foo (2).doc" etc. A script that just looked for files with such names, compared them to foo.doc, and deleted them if they are the same would be useful.
Does it only find duplicate files or will it also find duplicate directory hierarchies?
Example:
/some/location/one/January/Photos
/some/location/two/January/Photos
I need a tool that would return a match on January directory.
It would be great to be able to filter things. So for example, if I have backups of my dev folder, I want to filter out all the virtual envs (venv below):
/home/HumblyTossed/dev/venv/bin
/home/HumblyTossed/backups/dev/venv/bin
If you want something that scales horizontally (mostly), dcmp from https://github.com/hpc/mpifileutils is an option. It can chunk up files and do the comparison in parallel on multiple servers.
No. CRC32 doesn't care what you throw at it. It is related to speed of building a "database", for lack of a better word, of the file. Instead of CRC32 entire file, you just get chunks of it, increasing the speed. However this approach it's definitely flawed as there are plenty of file types that have the beginning and the end identical, so only the readings/CRC32 of the middle section might be actually useful. But CRC32 has a lower space hence collisions have higher chance to happen.
The better approach might be, for same size files, to just Seek(FileSize div 2) and read 32 bytes from there. If those are identical with another file then start a full file comparison until one character diverges then stop. If multiple files are having these same middle bytes then maybe do, for each file, a full SHA256 and compare those.
Also, as other commenters pointed, you might have same info but meta is different (videos, pictures, etc) so that needs to be implemented as well.
[+] [-] idoubtit|4 years ago|reply
But the main problem is not the suspicious performance, it's the lack of explanation. The tool is supposed to "find duplicate files (photos, videos, music, documents)". Does it mean it is restricted to some file types? Does it find identical photos with different metadata to be duplicates? Compare this with rdfind which clearly describes what it does, provides a summary of its algorithm, and even mentions alternatives.
Overall, it may be a fine toy/hobby project (3 commits only, 3 months ago), I didn't read the code (except for finding the command-line options). I don't get why it got so much attention.
[+] [-] artemisart|4 years ago|reply
[+] [-] justinsaccount|4 years ago|reply
It initially groups files that have the "same extension and same size", so you're out of luck if you have two copies named foo.jpg and foo.jpeg.
Then, it cheats by computing a crc32 (!) of the beginning, middle, and end bytes of the file and groups together files that have the same crc32.
So, it'll mostly work, but miss a lot of duplicates, and potentially flag different files as duplicates.
[+] [-] diskzero|4 years ago|reply
The cynical side of me wants to know what features and safety checks a "blazingly fast" tool has not implemented that the older "glacially slow" tool it is replacing ended up implementing after all the edge conditions were uncovered.
[+] [-] code_biologist|4 years ago|reply
[+] [-] nieve|4 years ago|reply
[+] [-] ColinWright|4 years ago|reply
It takes less time to type this that than it does to remember what some random other tool is called, or how to use it. I also have saved a variant that identifies similar files, and another that identifies directory structures with lots of shared files, but those are (understandably) more complex (and fragile).
[+] [-] pkolaczk|4 years ago|reply
Anyway, thanks for sharing - it is always very exciting to see how far you can go with a few unix utilities and a bit of scripting :)
[+] [-] mikst|4 years ago|reply
find ... \! -empty ...
they have the same hash, but they do not need to be treated as duplicate
[+] [-] pdimitar|4 years ago|reply
[+] [-] andmarios|4 years ago|reply
It finds duplicate files and replaces them with hard links, saving you space. Just make sure you provide it with paths in the same filesystem.
I originally wrote it to save some space from personal files (videos, photos, etc), but it turned out very useful for tar files, docker images, websites, and more. For example I maintain a tar file and a docker image with Kafka connectors which share many jar files. Using duphard I can save hundreds of megabytes, or even more than a gigabyte! For a documentation website with many copies of the same image (let's just say some static generators favor this practice for maintaining multiple versions), I can reduce the website size by 60%+, which then makes ssh copies, docker pulls, etc way faster speeding up deployment times.
https://github.com/andmarios/duphard
[+] [-] pkolaczk|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] coryrc|4 years ago|reply
[+] [-] pkolaczk|4 years ago|reply
Use fclones, fslint, jdupes, rdfind instead, which either use much stronger hashes (128-bit) or even verify files by direct byte-to-byte comparison.
[+] [-] matzf|4 years ago|reply
[+] [-] pkolaczk|4 years ago|reply
E.g. see this:
* https://www.strchr.com/hash_functions
* https://jpountz.github.io/lz4-java/1.2.0/xxhash-benchmark/
[+] [-] mimentum|4 years ago|reply
[+] [-] m-manu|4 years ago|reply
Now to address a few concerns:
# The tool doesn't delete anything -- As the name suggests, it just finds duplicates. Check it out.
# File uniqueness is determined by file extension + file size + CRC32 of first 4KiB, middle 2KiB and last 2KiB
# Above seems not much. But, on my portable hard drive with >172K files (mix of video, audio, pics and source code), I got the same number of collisions as that of "SHA-256 of entire file" (By the way, I'm planning to add an option in the tool to do this)
[+] [-] smusamashah|4 years ago|reply
How does it handle small files?
[+] [-] m-manu|4 years ago|reply
[+] [-] karteum|4 years ago|reply
* being able to detect not only duplicate files but also duplicate dirs (without returning all their sub-contents as duplicates)
* being able to query multiple times without having to re-scan, and to do other types of queries (i.e. I am computing a hash on all files, not only of those with duplicate sizes. This makes scanning slower but enables other use-cases. N.b. beware that I only hash fixed portions of files for files>3MB, which is enough for my use-case considering that I always triple-check the results and is a reasonable tradeoff for performance, but it might not be OK for everyone !)
* being able to tell whether all files in dir1/ are included in dir2/ (regardless of file/dir structure)
* being able to mount the sqlite index as a FUSE FS (which is convenient for e.g. diff -r or qdirstat...)
Still work-in-progress, yet it works for several of my use-cases
[+] [-] scns|4 years ago|reply
The first one i found and still use when it got obvious that fslint is EOL is czkawka [0] (meaning hiccup in polish). Its' speed is an order of magnitude higher than fslint, memory use is 20%-75%.
<;)> Satisfied customer, would buy it again. </;)>
[0] https://github.com/qarmin/czkawka
[+] [-] Bostonian|4 years ago|reply
[+] [-] sumtechguy|4 years ago|reply
I have had pretty good luck with that one. I used to use 'duplicate commander' but I am not sure that one is out there anymore.
[+] [-] HumblyTossed|4 years ago|reply
Example:
/some/location/one/January/Photos
/some/location/two/January/Photos
I need a tool that would return a match on January directory.
It would be great to be able to filter things. So for example, if I have backups of my dev folder, I want to filter out all the virtual envs (venv below): /home/HumblyTossed/dev/venv/bin /home/HumblyTossed/backups/dev/venv/bin
[+] [-] worldeva|4 years ago|reply
https://man7.org/linux/man-pages/man1/find.1.html (-mindepth -maxdepth can also be added to make it stricter)
https://github.com/sharkdp/fd[+] [-] chalcolithic|4 years ago|reply
[+] [-] fintler|4 years ago|reply
[+] [-] yandrypozo|4 years ago|reply
[+] [-] selcuka|4 years ago|reply
The author says that they tested it on 172K+ files and it's safe, but I still wouldn't trust it enough to delete files from my filesystem.
[+] [-] unnouinceput|4 years ago|reply
The better approach might be, for same size files, to just Seek(FileSize div 2) and read 32 bytes from there. If those are identical with another file then start a full file comparison until one character diverges then stop. If multiple files are having these same middle bytes then maybe do, for each file, a full SHA256 and compare those.
Also, as other commenters pointed, you might have same info but meta is different (videos, pictures, etc) so that needs to be implemented as well.