Go Find Duplicates: A fast and simple tool to find duplicate files

[+] idoubtit|4 years ago|reply

As far as I know, the standard tool for this is rdfind. This new tool claims to be "blazingly fast", so it should provide something to show it. Ideally a comparison with rdfind, but even a basic benchmark would make it less dubious. https://github.com/pauldreik/rdfind

But the main problem is not the suspicious performance, it's the lack of explanation. The tool is supposed to "find duplicate files (photos, videos, music, documents)". Does it mean it is restricted to some file types? Does it find identical photos with different metadata to be duplicates? Compare this with rdfind which clearly describes what it does, provides a summary of its algorithm, and even mentions alternatives.

Overall, it may be a fine toy/hobby project (3 commits only, 3 months ago), I didn't read the code (except for finding the command-line options). I don't get why it got so much attention.

[+] artemisart|4 years ago|reply

See also fclones (focuses on performance, has benchmarks https://github.com/pkolaczk/fclones). I didn't know about rdfind but thought the standard was fdupes https://github.com/adrianlopezroche/fdupes, which is as fast (or slow) as rdfind according to fclones (and fclones is much faster).

[+] justinsaccount|4 years ago|reply

Yeah, this tool does not appear to be very good, especially compared to established alternatives.

It initially groups files that have the "same extension and same size", so you're out of luck if you have two copies named foo.jpg and foo.jpeg.

Then, it cheats by computing a crc32 (!) of the beginning, middle, and end bytes of the file and groups together files that have the same crc32.

So, it'll mostly work, but miss a lot of duplicates, and potentially flag different files as duplicates.

[+] diskzero|4 years ago|reply

I think we need a lookup table of marketing speech to real-world performance metrics. Blazingly fast has been showing up a lot lately.

The cynical side of me wants to know what features and safety checks a "blazingly fast" tool has not implemented that the older "glacially slow" tool it is replacing ended up implementing after all the edge conditions were uncovered.

[+] code_biologist|4 years ago|reply

I've found rmlint to be another very good tool in this space: https://github.com/sahib/rmlint

[+] nieve|4 years ago|reply

Do you know of any tool that does a good job of finding files that differ only in their metadata or even better can use a perceptual hash to find possible matches? Geeqie's find duplicates seems to do the latter, but afaict you can't run that function from the command line.

[+] ColinWright|4 years ago|reply

Over the years I've used many, many tools intended to solve this problem. In the end, after much frustration, I just use existing tools, glued together in a un*x manner.

  find * -type f -exec md5sum '{}' ';' \
  | tee /tmp/index_file.txt            \
  | gawk '{print $1}'                  \
  | sort | uniq -c                     \
  | gawk '/^ *1 /{ print $2 }          \
  > /tmp/duplicates.txt

  for m in $( cat /tmp/duplicates.txt )
  do
    grep $m /tmp/index_file.txt
    echo ========
    done \
  | less

Tweak as necessary. I do have a comparison executable that only compares sizes and sub-portions to save time, but I generally find it's not worth it.

It takes less time to type this that than it does to remember what some random other tool is called, or how to use it. I also have saved a variant that identifies similar files, and another that identifies directory structures with lots of shared files, but those are (understandably) more complex (and fragile).

[+] pkolaczk|4 years ago|reply

Most programs I tested have very simple basic usage - just the program name and a list of directories. I doubt typing the above would be faster, and figuring out for the first time - definitely not. Also executing that on a million of files would take ages, even compared to slowest proper duplicate finders.

Anyway, thanks for sharing - it is always very exciting to see how far you can go with a few unix utilities and a bit of scripting :)

[+] mikst|4 years ago|reply

You need to check for empty files

find ... \! -empty ...

they have the same hash, but they do not need to be treated as duplicate

[+] pdimitar|4 years ago|reply

Can you share your other scripts? They sound exactly like what I need lately!

[+] andmarios|4 years ago|reply

A shameless plug but it is a simple —and probably bad written— tool I made many years ago to scratch an itch, and I still use it.

It finds duplicate files and replaces them with hard links, saving you space. Just make sure you provide it with paths in the same filesystem.

I originally wrote it to save some space from personal files (videos, photos, etc), but it turned out very useful for tar files, docker images, websites, and more. For example I maintain a tar file and a docker image with Kafka connectors which share many jar files. Using duphard I can save hundreds of megabytes, or even more than a gigabyte! For a documentation website with many copies of the same image (let's just say some static generators favor this practice for maintaining multiple versions), I can reduce the website size by 60%+, which then makes ssh copies, docker pulls, etc way faster speeding up deployment times.

https://github.com/andmarios/duphard

[+] pkolaczk|4 years ago|reply

A shameless plug: fclones does that as well, including support for symlinks.

[+] unknown|4 years ago|reply

[deleted]

[+] coryrc|4 years ago|reply

Fdupes does that too

[+] pkolaczk|4 years ago|reply

This program uses CRC32 to compute hashes. This is a terrible idea - a 32bit hash is just too short and the probability of collisions is way too high. Only a few thousand files are enough to get 50% probability of a collision. Even though this is decreased by additional matching by extensions and sizes, I wouldn't trust it to delete any files.

Use fclones, fslint, jdupes, rdfind instead, which either use much stronger hashes (128-bit) or even verify files by direct byte-to-byte comparison.

[+] matzf|4 years ago|reply

Ah yes, that's exactly what CRC32 is supposed to be used for. And it's even quicker if you don't compute it over the whole file, brilliant!

[+] pkolaczk|4 years ago|reply

AFAIK CRCs are not the fastest "hashes" you can get. Some non-cryptographic hashes outperform crcs by a large factor and provide longer checksums and much better statistical properties.

E.g. see this:

* https://www.strchr.com/hash_functions

* https://jpountz.github.io/lz4-java/1.2.0/xxhash-benchmark/

[+] mimentum|4 years ago|reply

I've been using 'czkawka' since it's earliest inception. Seems to do a similar tast using file hashes but can also search through to match Pictures and the like. https://github.com/qarmin/czkawka

[+] m-manu|4 years ago|reply

As the author of the tool, thanks a lot for wonderful inputs! Many comments are actionable. I'll incorporate them in code soon.

Now to address a few concerns:

# The tool doesn't delete anything -- As the name suggests, it just finds duplicates. Check it out.

# File uniqueness is determined by file extension + file size + CRC32 of first 4KiB, middle 2KiB and last 2KiB

# Above seems not much. But, on my portable hard drive with >172K files (mix of video, audio, pics and source code), I got the same number of collisions as that of "SHA-256 of entire file" (By the way, I'm planning to add an option in the tool to do this)

[+] smusamashah|4 years ago|reply

This sounds interesting and should probably be able to run on whole system. What if you run in the files of the OS itself? e.g. Whole C drive or where Linux system files are. Will there be any collisions?

How does it handle small files?

[+] m-manu|4 years ago|reply

Changes have been incorporated. FYI.

[+] karteum|4 years ago|reply

FWIW if people are interested, I wrote https://github.com/karteum/kindfs for the purpose of indexing the hard drive, with the following goals

* being able to detect not only duplicate files but also duplicate dirs (without returning all their sub-contents as duplicates)

* being able to query multiple times without having to re-scan, and to do other types of queries (i.e. I am computing a hash on all files, not only of those with duplicate sizes. This makes scanning slower but enables other use-cases. N.b. beware that I only hash fixed portions of files for files>3MB, which is enough for my use-case considering that I always triple-check the results and is a reasonable tradeoff for performance, but it might not be OK for everyone !)

* being able to tell whether all files in dir1/ are included in dir2/ (regardless of file/dir structure)

* being able to mount the sqlite index as a FUSE FS (which is convenient for e.g. diff -r or qdirstat...)

Still work-in-progress, yet it works for several of my use-cases

[+] scns|4 years ago|reply

<irony> RESF checking in </irony>

The first one i found and still use when it got obvious that fslint is EOL is czkawka [0] (meaning hiccup in polish). Its' speed is an order of magnitude higher than fslint, memory use is 20%-75%.

<;)> Satisfied customer, would buy it again. </;)>

[0] https://github.com/qarmin/czkawka

[+] Bostonian|4 years ago|reply

On Windows if you download the same file more than once you will have foo.doc, "foo (1).doc", "foo (2).doc" etc. A script that just looked for files with such names, compared them to foo.doc, and deleted them if they are the same would be useful.

[+] sumtechguy|4 years ago|reply

http://malich.ru/duplicate_searcher

I have had pretty good luck with that one. I used to use 'duplicate commander' but I am not sure that one is out there anymore.

[+] HumblyTossed|4 years ago|reply

Does it only find duplicate files or will it also find duplicate directory hierarchies?

Example:

/some/location/one/January/Photos

/some/location/two/January/Photos

I need a tool that would return a match on January directory.

It would be great to be able to filter things. So for example, if I have backups of my dev folder, I want to filter out all the virtual envs (venv below): /home/HumblyTossed/dev/venv/bin /home/HumblyTossed/backups/dev/venv/bin

[+] worldeva|4 years ago|reply

Sounds like you're looking for `find` or `fd`

https://man7.org/linux/man-pages/man1/find.1.html (-mindepth -maxdepth can also be added to make it stricter)

  find some/location -type d -wholename '*/January/Photos'

https://github.com/sharkdp/fd

  fd -p '*/January/Photos' some/location

[+] chalcolithic|4 years ago|reply

If you downvote HumblyTossed's comment please explain why.

[+] fintler|4 years ago|reply

If you want something that scales horizontally (mostly), dcmp from https://github.com/hpc/mpifileutils is an option. It can chunk up files and do the comparison in parallel on multiple servers.

[+] yandrypozo|4 years ago|reply

I'm curious, why did the author read only on three sections for each file? is related on how CRC32 works?

[+] selcuka|4 years ago|reply

It is to save time by not reading the whole file for files larger than `thresholdFileSize`. The code calls it fuzzy hash.

The author says that they tested it on 172K+ files and it's safe, but I still wouldn't trust it enough to delete files from my filesystem.

[+] unnouinceput|4 years ago|reply

No. CRC32 doesn't care what you throw at it. It is related to speed of building a "database", for lack of a better word, of the file. Instead of CRC32 entire file, you just get chunks of it, increasing the speed. However this approach it's definitely flawed as there are plenty of file types that have the beginning and the end identical, so only the readings/CRC32 of the middle section might be actually useful. But CRC32 has a lower space hence collisions have higher chance to happen.

The better approach might be, for same size files, to just Seek(FileSize div 2) and read 32 bytes from there. If those are identical with another file then start a full file comparison until one character diverges then stop. If multiple files are having these same middle bytes then maybe do, for each file, a full SHA256 and compare those.

Also, as other commenters pointed, you might have same info but meta is different (videos, pictures, etc) so that needs to be implemented as well.

53 comments