We've just open-sourced our library imagededup, a Python package that simplifies the task of finding exact and near duplicates in an image collection.
It includes several hashing algorithms (PHash, DHash etc) and convolutional neural networks. Secondly, an evaluation framework to judge the quality of deduplication. Finally easy plotting functionality of duplicates and a simple API.
We're really excited about this library because finding image duplication is a very important task in computer vision and machine learning. For example, severe duplicates can create extreme biases in your evaluation of your ML model (check out the CIFAR-10 problem). Please try out our library, star it on Github and spread the word! We'd love to get feedback.
This looks really interesting. Can you give us an idea of the performance? E.g. roughly how long would it take to process 1 million 1920×1080 JPEGS, without GPU and with GPU?
What is the scaling like? E.g. what if it was 10 million?
Thanks! This is a very important problemspace for us, as we do document processing and double/triple uploads are hard to detect and create unnecessary effort.
Can you elaborate a little bit on the performance you've observed? Can it work iteratively, basically asking one image at a time "have we already seen this?"
Would it work for text documents of varying quality aswell or is this unfeasible?
Did you consider using the PhotoDNA hash algo for finding duplicates? If you’ve heard of it and ruled it out, love to know why. While designed for a very different (and dark) purpose, seems like it might do well for the task.
Very nice. If I may ask for clarification on the CNN method (which seems to work quite well), you're taking a CNN that's built for classification but removing the last layer (which I believe is fully connected?) so that you only take the "internal features", is that correct? I would expect some kind of autoencoder would fit here, but very interesting that this works.
This looks nice, and seems to support the most common methods for fingerprinting/hashing. This comes with some heavy dependencies, though (which is reasonable):
For a split second I was excited and terrified because I thought this was a very similarly named Go project I wrote a while ago. It’s nowhere near as fancy but is very fast and has a very similar name.
I don’t have the background in imaging these people likely have but mine works by breaking an image into an X by X map of average colors and comparing, written specifically because I needed to find similar images of different aspect ratios and at the time I couldn’t find anything.
This came a bit late. I recently decided I had to sort all my photos which I usually just dump in a big photos folder. Using the camera on my phone a lot and also getting a lot of media through whatsapp the collection was getting a bit big.
I made a script to calculate the hash of every file and if it found a double it would move it to another duplicate folder. This worked reasonably well but I couldn't stop thinking there should be more than 1 solution already made for this.
quicker than hashing the file, you might want to compare exif data (extract with exiftool), i have been comparing image date/time (to the second) as tagged by the camera, and when i find a duplicate, i keep the one with the largest image size. I've not worked out how to deal with those without exiftags. I understand ShotWell hashes the thumbnail to find dupes. The security camera software motion has some image comparison and to determine if the camera image has changed since the last image, i think it was visual in nature, rather than hashing, since webcams are "noisy".
Here's a broad and perhaps a bit naive question on this;
Reddit, Imgur, and any other site that uploads significant amounts of images from significant amount of users.. do they attempt to do this? To de-dupe images and instead create virtual links?
At face value it'd seem like a crazy amount of physical disk space savings, but maybe the processing overhead is too expensive?
I once built an image comparison feature into some webpage that had uploads. What I did was scale down all images (for comparison only) to something like 100x100 and I think I made them black and white, but I am not sure about that last detail. I'd then XOR one thumbnail with another to compare their level of similarity. I didn't come up with this myself, I put together a few pieces of information from around the web... as with about 100% of things I build ;).
Not perfect, but it worked pretty well for images that were exactly the same. Of course it isn't as advanced as Imagededup.
People do deduplicate files to save on space, except it's usually based on exact byte match using md5 or sha256. Some don't due to privacy issues: https://news.ycombinator.com/item?id=2438181 (e.g., MPAA can upload all their torrented movies and see which ones uploaded instantly to prove that your system has their copyrighted files)
There's no way to make the UX work out for images that are only similar. Would be pretty wild to upload a picture of myself just to see a picture of my twin used instead.
But I do wonder if it's possible to deduplicate different resolutions of an image that only differ in upscaling/downscaling algorithm and compression level used (thereby solving the jpeg erosion problem: https://xkcd.com/1683/)
I wonder how much of it could be adapted to finding duplicate documents, e.g. homeworks, CVs, etc. Presumably, the hashing would have to be adapted slightly. But how much?
This is really nice. I have a semi-abandoned project that uses PHash to canvas my photo library to flag duplicates, and am quite likely to use this instead.
Now if only Apple hadn’t repeatedly broken PyObjC over the years...
As I pointed in other comments, the current implementation does not focus on the scale problem. However, using the 'scores' attribute of the 'find_duplicates' function, one could obtain the hamming distance/cosine similarity and then use that to sort. For more, please refer the docs.
The API does not seem to support this, but it should be easy to hack this (return not only a list of duplicates but the actual distance to the target image along with it, then sort results by distance).
did you evaluate the 'imagehash' [1] library prior to working on this-- any limitations/concerns? the additional CNN seems to be the difference between the two libraries
Yes, before developing the package, we were also using this great library for hash generation. There are a bunch of differences we have compared to imagehash:
1. Added CNN as you mentioned
2. Took care of housekeeping functions like efficient retrieval (using bktree, also parallelized)
3. Added plotting abilities for visualizing duplicates
4. Added possibilities to do evaluation of deduplication algorithm so that the user can judge the deduplication performance on a custom dataset (with classification and information retrieval metrics)
5. Allow possibility to change thresholds to better capture the idea of 'duplicate' for specific user cases
[+] [-] datitran|6 years ago|reply
It includes several hashing algorithms (PHash, DHash etc) and convolutional neural networks. Secondly, an evaluation framework to judge the quality of deduplication. Finally easy plotting functionality of duplicates and a simple API.
We're really excited about this library because finding image duplication is a very important task in computer vision and machine learning. For example, severe duplicates can create extreme biases in your evaluation of your ML model (check out the CIFAR-10 problem). Please try out our library, star it on Github and spread the word! We'd love to get feedback.
[+] [-] mkl|6 years ago|reply
What is the scaling like? E.g. what if it was 10 million?
[+] [-] Roritharr|6 years ago|reply
Can you elaborate a little bit on the performance you've observed? Can it work iteratively, basically asking one image at a time "have we already seen this?"
Would it work for text documents of varying quality aswell or is this unfeasible?
[+] [-] jsjohnst|6 years ago|reply
[+] [-] amrrs|6 years ago|reply
Anyone can just simply fork this kernel and add their dataset to it and perform the task!
[+] [-] darkmighty|6 years ago|reply
[+] [-] innagadadavida|6 years ago|reply
[+] [-] albertzeyer|6 years ago|reply
https://stackoverflow.com/questions/4196453/simple-and-fast-...
There are some interesting discussions. (Nowadays, such a question would have been closed...)
[+] [-] orf|6 years ago|reply
As this is being consumed by larger applications that may have dependencies that conflict with these, they should be much more liberal.
https://github.com/idealo/imagededup/pull/36
[+] [-] donatj|6 years ago|reply
I don’t have the background in imaging these people likely have but mine works by breaking an image into an X by X map of average colors and comparing, written specifically because I needed to find similar images of different aspect ratios and at the time I couldn’t find anything.
https://github.com/donatj/imgdedup
[+] [-] wruza|6 years ago|reply
[+] [-] shifto|6 years ago|reply
I made a script to calculate the hash of every file and if it found a double it would move it to another duplicate folder. This worked reasonably well but I couldn't stop thinking there should be more than 1 solution already made for this.
[+] [-] ptaffs|6 years ago|reply
[+] [-] unknown|6 years ago|reply
[deleted]
[+] [-] sekasi|6 years ago|reply
Reddit, Imgur, and any other site that uploads significant amounts of images from significant amount of users.. do they attempt to do this? To de-dupe images and instead create virtual links?
At face value it'd seem like a crazy amount of physical disk space savings, but maybe the processing overhead is too expensive?
[+] [-] chrischen|6 years ago|reply
[+] [-] mosselman|6 years ago|reply
Not perfect, but it worked pretty well for images that were exactly the same. Of course it isn't as advanced as Imagededup.
[+] [-] jsjohnst|6 years ago|reply
Legacy style: https://66.media.tumblr.com/tumblr_m61cvzNYF81qg0jdoo1_640.g...
New style: https://66.media.tumblr.com/76451d8fee12cd3c5971e20bb8e236e3...
[+] [-] throwaway_bad|6 years ago|reply
There's no way to make the UX work out for images that are only similar. Would be pretty wild to upload a picture of myself just to see a picture of my twin used instead.
But I do wonder if it's possible to deduplicate different resolutions of an image that only differ in upscaling/downscaling algorithm and compression level used (thereby solving the jpeg erosion problem: https://xkcd.com/1683/)
[+] [-] cheschire|6 years ago|reply
[+] [-] cosmic_quanta|6 years ago|reply
I wonder how much of it could be adapted to finding duplicate documents, e.g. homeworks, CVs, etc. Presumably, the hashing would have to be adapted slightly. But how much?
[+] [-] herohamp|6 years ago|reply
[+] [-] rcarmo|6 years ago|reply
Now if only Apple hadn’t repeatedly broken PyObjC over the years...
[+] [-] wertenshap|6 years ago|reply
[+] [-] unknown|6 years ago|reply
[deleted]
[+] [-] JonathanFly|6 years ago|reply
[+] [-] tanujjain|6 years ago|reply
[+] [-] jlg23|6 years ago|reply
[+] [-] Dowwie|6 years ago|reply
[1] https://github.com/JohannesBuchner/imagehash
[+] [-] datitran|6 years ago|reply
[+] [-] dandigangi|6 years ago|reply