top | item 9776483

(no title)

bittersweet | 10 years ago

I was actually looking in to Tesseract yesterday, so this post coincides nicely.

I have a hobby project where I scrape instagram photos, and I actually only want to end up with photos with actual people in them. There are a lot of images being posted with motivational texts etc that I want to automatically filter out.

So far I've already built a binary that scans images and spits out the dominant color percentage, if 1 color is over a certain percentage (so black background white text for example), I can be pretty sure it's not something I want to keep.

I've also tried OpenCV with facial recognition but I had a lot of false positives with faces being recognized in text and random looking objects, and I've tried out 4 of the haarcascades, all with different, but not 'perfect' end results.

OCR was my next step to check out, maybe I can combine all the steps to get something nice. I was getting weird texts back from images with no text, so the pre-processing hints in this thread are gold and I can't wait to check those out.

This thread is giving me so much ideas and actual names of algorithms to check out, I love it. But I would really appreciate it if anyone else has more thoughts about how to filter out images that do not contain people :-)

discuss

order

kefka|10 years ago

My solution with regards to bad facial detection in OpenCV is to do the following:

1. Use an LBP cascade on the picture. This is lower quality, higher false positives. Uses integer math so this is fast. Its named lbpcascade_frontalface.xml

2. Capture the regions of interest that the LBP cascade identifies with a face and throw into a vector<Mat>. This means you can capture (potentially) arbitrary amount of faces. Of course, with OpenCV you are limited to a minimum of 28px by 28px minimum face.

3. Run the haar cascade for eye detection on the ROI's you saved in the vector. Ones that return eyes show a good match. Haar cascades are slower(because they use floats), but the reduction in pixels means its relatively fast. Its named haarcascade_eye_tree_eyeglasses.xml

I can maintain 20fps with this setup at 800x600 on a slow computer.

paulmd|10 years ago

For the motivational posters specifically, you might want to check out a Perceptual Hash type algorithm. Convert the image to a 64px square low-depth grayscale (4b?) and most of them should look more or less the same to PHash. Maybe you then classify them into clusters based on hash distance or something.

I also like your dominant-color thing. If you have a couple approaches that each make sense, you can use them as an ensemble - the more classifiers that don't like something, the more likely it's junk.

huskyr|10 years ago

I've done basically the same thing as you for a project i did for a newspaper where i collected 9000 selfies (see http://vk.nl/selfies).

It's a lot of manual work, but using OpenCV saves you a lot of time. I can't share the code unfortunately, but what i did was this:

* Get all Instagram photos with the '#selfie' tag

* Run it all through the haarcascade_frontalface_alt2 OpenCV cascade, i used the 1.3 and 5 values for the detectMultiScale() method.

* Check that there's only one face in the image, and make sure it's larger than 20% of the width of the image.

Even after that i still needed to go manually through the images. I guess around 10% was still false positives.

extempore|10 years ago

> But I would really appreciate it if anyone else has more thoughts about how to filter out images that do not contain people :-)

Google allows image searches to be filtered by is/isn't a face. I think you could tap into that knowledge, although it isn't immediately clear what the route would be.

Here's a detailed rundown of the more obscure google search paramters: https://stenevang.wordpress.com/2013/02/22/google-search-url...

The relevant one to your interest is "tbs=itp:face".