Show HN: Imagededup – Finding duplicate images made easy

[+] datitran|6 years ago|reply

We've just open-sourced our library imagededup, a Python package that simplifies the task of finding exact and near duplicates in an image collection.

It includes several hashing algorithms (PHash, DHash etc) and convolutional neural networks. Secondly, an evaluation framework to judge the quality of deduplication. Finally easy plotting functionality of duplicates and a simple API.

We're really excited about this library because finding image duplication is a very important task in computer vision and machine learning. For example, severe duplicates can create extreme biases in your evaluation of your ML model (check out the CIFAR-10 problem). Please try out our library, star it on Github and spread the word! We'd love to get feedback.

[+] mkl|6 years ago|reply

This looks really interesting. Can you give us an idea of the performance? E.g. roughly how long would it take to process 1 million 1920×1080 JPEGS, without GPU and with GPU?

What is the scaling like? E.g. what if it was 10 million?

[+] Roritharr|6 years ago|reply

Thanks! This is a very important problemspace for us, as we do document processing and double/triple uploads are hard to detect and create unnecessary effort.

Can you elaborate a little bit on the performance you've observed? Can it work iteratively, basically asking one image at a time "have we already seen this?"

Would it work for text documents of varying quality aswell or is this unfeasible?

[+] jsjohnst|6 years ago|reply

Did you consider using the PhotoDNA hash algo for finding duplicates? If you’ve heard of it and ruled it out, love to know why. While designed for a very different (and dark) purpose, seems like it might do well for the task.

[+] amrrs|6 years ago|reply

Thanks Dat for open sourcing it, once again I'm shamelessly linking my Kaggle Kernel with `imagededup` installed https://www.kaggle.com/nulldata/image-duplicates-without-dee.... This one used `PHash()` and a run time less than 88 seconds.

Anyone can just simply fork this kernel and add their dataset to it and perform the task!

[+] darkmighty|6 years ago|reply

Very nice. If I may ask for clarification on the CNN method (which seems to work quite well), you're taking a CNN that's built for classification but removing the last layer (which I believe is fully connected?) so that you only take the "internal features", is that correct? I would expect some kind of autoencoder would fit here, but very interesting that this works.

[+] innagadadavida|6 years ago|reply

Any idea if there is something that supports animated gifs?

[+] albertzeyer|6 years ago|reply

This looks nice, and seems to support the most common methods for fingerprinting/hashing. This comes with some heavy dependencies, though (which is reasonable):

    install_requires=[
        'numpy==1.16.3',
        'Pillow==6.0.0',
        'PyWavelets==1.0.3',
        'scipy==1.2.1',
        'tensorflow==2.0.0',
        'tqdm==4.35.0',
        'scikit-learn==0.21.2',
        'matplotlib==3.1.1',
    ],

A while ago, I asked about sth like this (or more about the underlying methods) here on SO:

https://stackoverflow.com/questions/4196453/simple-and-fast-...

There are some interesting discussions. (Nowadays, such a question would have been closed...)

[+] orf|6 years ago|reply

If this is a library then locking those dependencies down is not great. Does it really need Pillow 6.0.0, and not Pillow 6.0.1?

As this is being consumed by larger applications that may have dependencies that conflict with these, they should be much more liberal.

https://github.com/idealo/imagededup/pull/36

[+] donatj|6 years ago|reply

For a split second I was excited and terrified because I thought this was a very similarly named Go project I wrote a while ago. It’s nowhere near as fancy but is very fast and has a very similar name.

I don’t have the background in imaging these people likely have but mine works by breaking an image into an X by X map of average colors and comparing, written specifically because I needed to find similar images of different aspect ratios and at the time I couldn’t find anything.

https://github.com/donatj/imgdedup

[+] wruza|6 years ago|reply

If anyone is interested in gui, XnViewMP does that too. https://www.journeybytes.com/2018/03/how-to-find-duplicate-p...

[+] shifto|6 years ago|reply

This came a bit late. I recently decided I had to sort all my photos which I usually just dump in a big photos folder. Using the camera on my phone a lot and also getting a lot of media through whatsapp the collection was getting a bit big.

I made a script to calculate the hash of every file and if it found a double it would move it to another duplicate folder. This worked reasonably well but I couldn't stop thinking there should be more than 1 solution already made for this.

[+] ptaffs|6 years ago|reply

quicker than hashing the file, you might want to compare exif data (extract with exiftool), i have been comparing image date/time (to the second) as tagged by the camera, and when i find a duplicate, i keep the one with the largest image size. I've not worked out how to deal with those without exiftags. I understand ShotWell hashes the thumbnail to find dupes. The security camera software motion has some image comparison and to determine if the camera image has changed since the last image, i think it was visual in nature, rather than hashing, since webcams are "noisy".

[+] unknown|6 years ago|reply

[deleted]

[+] sekasi|6 years ago|reply

Here's a broad and perhaps a bit naive question on this;

Reddit, Imgur, and any other site that uploads significant amounts of images from significant amount of users.. do they attempt to do this? To de-dupe images and instead create virtual links?

At face value it'd seem like a crazy amount of physical disk space savings, but maybe the processing overhead is too expensive?

[+] chrischen|6 years ago|reply

They would not do deduplication like this because this is based on similar images, but they probably (and should) do it by image hash.

[+] mosselman|6 years ago|reply

I once built an image comparison feature into some webpage that had uploads. What I did was scale down all images (for comparison only) to something like 100x100 and I think I made them black and white, but I am not sure about that last detail. I'd then XOR one thumbnail with another to compare their level of similarity. I didn't come up with this myself, I put together a few pieces of information from around the web... as with about 100% of things I build ;).

Not perfect, but it worked pretty well for images that were exactly the same. Of course it isn't as advanced as Imagededup.

[+] jsjohnst|6 years ago|reply

Tumblr did something similar, but only for exact matches. You can tell if it’s a legacy image or not by looking for a hash in the image url path.

Legacy style: https://66.media.tumblr.com/tumblr_m61cvzNYF81qg0jdoo1_640.g...

New style: https://66.media.tumblr.com/76451d8fee12cd3c5971e20bb8e236e3...

[+] throwaway_bad|6 years ago|reply

People do deduplicate files to save on space, except it's usually based on exact byte match using md5 or sha256. Some don't due to privacy issues: https://news.ycombinator.com/item?id=2438181 (e.g., MPAA can upload all their torrented movies and see which ones uploaded instantly to prove that your system has their copyrighted files)

There's no way to make the UX work out for images that are only similar. Would be pretty wild to upload a picture of myself just to see a picture of my twin used instead.

But I do wonder if it's possible to deduplicate different resolutions of an image that only differ in upscaling/downscaling algorithm and compression level used (thereby solving the jpeg erosion problem: https://xkcd.com/1683/)

[+] cheschire|6 years ago|reply

The CPU cost would far outweigh the storage cost.

[+] cosmic_quanta|6 years ago|reply

Nice project!

I wonder how much of it could be adapted to finding duplicate documents, e.g. homeworks, CVs, etc. Presumably, the hashing would have to be adapted slightly. But how much?

[+] herohamp|6 years ago|reply

For documents I wouldnt do anything listed here. I would just compare the contents on a line by line or word by word basis

[+] rcarmo|6 years ago|reply

This is really nice. I have a semi-abandoned project that uses PHash to canvas my photo library to flag duplicates, and am quite likely to use this instead.

Now if only Apple hadn’t repeatedly broken PyObjC over the years...

[+] wertenshap|6 years ago|reply

I’ve been trying to sort through multiple backups of my photo library for several years and such a tool will be very useful.

[+] unknown|6 years ago|reply

[deleted]

[+] JonathanFly|6 years ago|reply

Could this be used to order a large set of images by similarity?

[+] tanujjain|6 years ago|reply

As I pointed in other comments, the current implementation does not focus on the scale problem. However, using the 'scores' attribute of the 'find_duplicates' function, one could obtain the hamming distance/cosine similarity and then use that to sort. For more, please refer the docs.

[+] jlg23|6 years ago|reply

The API does not seem to support this, but it should be easy to hack this (return not only a list of duplicates but the actual distance to the target image along with it, then sort results by distance).

[+] Dowwie|6 years ago|reply

did you evaluate the 'imagehash' [1] library prior to working on this-- any limitations/concerns? the additional CNN seems to be the difference between the two libraries

[1] https://github.com/JohannesBuchner/imagehash

[+] datitran|6 years ago|reply

Yes, before developing the package, we were also using this great library for hash generation. There are a bunch of differences we have compared to imagehash: 1. Added CNN as you mentioned 2. Took care of housekeeping functions like efficient retrieval (using bktree, also parallelized) 3. Added plotting abilities for visualizing duplicates 4. Added possibilities to do evaluation of deduplication algorithm so that the user can judge the deduplication performance on a custom dataset (with classification and information retrieval metrics) 5. Allow possibility to change thresholds to better capture the idea of 'duplicate' for specific user cases

[+] dandigangi|6 years ago|reply

Sweet. Just what I needed to remove duplicate memes from my phone.

57 comments