WingNews GitHub [2]

top | item 6559071

What is the best way to programatically detect porn images? (2009)

142 points| romain_g | 12 years ago |stackoverflow.com | reply

104 comments

[+] nailer|12 years ago|reply

I used pifilter (now WeSEE:Filter , http://corp.wesee.com/products/filter/) for a production, realtime, anonymous confession site (imeveryone.com) in 2010.

It cost, IIRC, a tenth of a cent per image URL. Rather than being based on skin tone, it was created based on algos to specifically identify labia, anuses, penises, etc. REST API: send a URL, get back a yes/no/maybe. You decided what to do with the maybes.

My experience:

- Before launch, I tested it with 4chan b as a feed, and was able to produce a mostly clean version of b with the exception of cartoon imagery.

- It could catch most of the stuff people tried to post to the site. Small breasted women (being that breasts are considered 'adult' in the US) was the only thing that would get through and wasn't a huge concern. Completely unmaintained public hair (as revealing as a black bikini) would also get through.

- Since people didn't know what I was testing with they didn't work around it (so nobody tried posting drawings or cartoons), but I imagine eg a photo of a prolapse might not trigger the anus detection as the shape would be too different.

- pifilter erred on the side of false negative, but one notable false positive: a pastrami sandwich.

[+] cobrausn|12 years ago|reply

I don't remember where I read this, but someone once recommended having a bot that posts the image to 4chan b feed and monitor the views and replies. Since porn images on b usually had somewhat predictable replies, it would be a good way to augment another technique and prevent false positives. Funny idea, but not really sure how viable.

[+] DanBC|12 years ago|reply

> Small breasted women (being that breasts are considered 'adult' in the US) was the only thing that would get through and wasn't a huge concern.

But that could also be developing breasts on a youth, and that would mean the image is something you very much want to block and report.

[+] twic|12 years ago|reply

"Completely unmaintained public hair". And now i'm wondering if we could disrupt the market with a Public Hair as a Service platform.

[+] Theodores|12 years ago|reply

In The Olden Pre-Digital Days porn was either in print or on a television screen. Back then (we are talking two whole decades ago) experienced broadcast engineers could instantly spot porn just by catching a look at an oscilloscope (of which there were usually many in a machine room).

Notionally the oscilloscope would be there to show that the luminance and chroma was okay in the signal (i.e. it could be broadcast over the airwaves to look as intended at the other end - PAL/NTSC), however, porn and anything likely to be porn had a distinctive pattern on the oscilloscope screen. Should porn be suspected then the source material would obviously be patched through to a monitor 'just in case'.

Note that the oscilloscope was analog and that the image would be changing 25/30 times a second. Also, back then there were not so many false positives on broadcast TV, e.g. pop videos etc. where today's audience deems them artful rather than porn.

If I had to solve the problem programatically I would find a retired broadcast engineer and start from there, with what can be learned from a 'scope.

[+] daurnimator|12 years ago|reply

It's probably related to the type of cameras available in the day rather than anything else...

[+] adorable|12 years ago|reply

I have developed an algorithm to detect such images, based on several articles published by research teams all over the world (it's incredible to see how many teams have tried to solve this problem!).

I found out that no single technique works great. If you want an efficient algorithm, you probably have to blend different ideas and compute a "nudity score" for each image. That's at least what I do.

I'd be happy to discuss how it works. Here are a few techniques used:

- color recognition (as discussed in other comments)

- haar-wavelets to detect specific shapes (that's what Facebook and others use to detect faces for example)

- texture recognition (skin and wood may have the same colors but not the same texture)

- shape/contour recognition (machine learning of course)

- matching with a growing database of NSFW images

The algorithm is open for test here: http://sightengine.com It works OK right now but once version 2 is out it should really be great.

[+] asolove|12 years ago|reply

Amazon Mechanical Turk has an adult-content marker specifically for this purpose. Lots of people have done the paperwork to qualify for adult-content jobs and the cost of having humans do it at scale is very low: https://requester.mturk.com/help/faq#can_explicit_offensive

Source: I helped implement a MT job to filter adult content for a large hosting company.

[+] phorese|12 years ago|reply

So i can literally get paid for looking at porn. Huh, who knew... :)

[+] ma2rten|12 years ago|reply

I did this for my bachelor thesis for a company that shall remain unnamed. I am pretty confident that my approach works better than any of the answer posted on stackoverflow.

I used the so called Bag of Visual Words approach. At that time the state of the art in image recognition (now it's neural networks). You can read about on Wikipeida. The only main change from the standard approach (SHIFT + k-means + histograms + SVM + chi2 kernel) was that I used a version of SHIFT that uses color features. In addition to this I used a second machine learning classifier based on the context of the picture. Who posted it? Is it a new user? What are the words in the title? How many view does the picture have....

In combination the two classifiers worked nearly flawless.

Shortly after that, chat roulette has having it's porn problem and it was in the media that the founder was working on a porn filter. I send an email to offer my help, but didn't get an reaction.

[+] anjc|12 years ago|reply

SHIFT or SIFT? What's SHIFT?

[+] clebio|12 years ago|reply

This sounds quite interesting. Is there any of the research of code base that you can share? Or otherwise any references about the standard approach which you would recommend?

[+] OnionChamp|12 years ago|reply

It would be interesting to see the images that would be generated if you took that system and "ran in backwards", insofar as that's possible.

[+] VLM|12 years ago|reply

This is probably going to get downvoted, but if lots of people are not overzealous puritans and want some skin, the best overall system design that maximizes happiness and profit is probably sharding into

puritanweirdos.example.com with no skin showing between toes and top of turtleneck (edited to add no pokies either)

and

normalpeople.example.com with 99% of the human race

The best solution to a problem involving computers is sometimes computer related, but sometimes is social. The puritans are never going to get along with the normal people anyway, so its not like sharding them is going to hurt.

Another way to hack the system is not to hire or accept holier than thou puritans. Personality doesn't mesh with the team, doesn't fit culture, etc. You have to draw the line somewhere, and weirdos on either end should get cut, so no CP or animals at one extreme, and no holy rollers on the other extreme.

The final social hack is its kind of like dealing with bullies via appeasement. So they're blocking reasonable stuff today, tomorrow they want to block all women not wearing burkhas or depictions of women damaging their ovaries by driving. Appeasing bullies never really works in the long run, so why bother starting. "If you claim not to like it, or at least enjoy telling everyone else repeatedly how you claim not to like it, stop looking at it so much, case closed"

[+] bad_user|12 years ago|reply

Porn detection still has its uses and you're making the mistake of saying that only puritans are interested in porn detection.

For example if you've got children, given your stance on the matter, you may not necessarily agree that a filter is necessary, but how about being alerted when your children are viewing obscene content? How about being alerted when your children are engaging in sexting?

When it comes to children, maintaining their purity is only one side of the coin, a necessity with which not all people necessarily agree. But the other side of the coin that's a pretty objective fact is that children do get the wrong ideas about what they see and sometimes it happens with adults too, with porn being the main reason why men think they need big penises to satisfy women. And there's a lot of weird porn out there. With improper exposure, a child can end up growing with certain ideas about doing sex, with certain complexes and so on.

And I'm not necessarily for censoring that content, as children can find ways around the censorship should they want to, plus these filters aren't perfect anyway. But I would find useful a system that alerts me when my child gets exposed to porn, such that I can take appropriate measures, like having fatherly talks about sex, explaining to him that what he just saw is a really bad idea in case he looked at something weird.

Plus, exposure to porn can happen 100% by accident and that's my personal problem with it. I take my monthly dosage for appealing to my stockings fetish, but you know, I like to be in control of when that happens. Like going to a website and clicking on something can trigger a popup involving ads to either poker games or porn. Sometimes they've got sound too. Imagine hearing in the workplace the sound of a woman's moan. It's totally disrespectful to your colleagues, as it disrupts their workflow. I was searching for something on ThePirateBay once and it happened to me.

[+] Shivetya|12 years ago|reply

Harassment laws/lawyers would find you their wet dream. The rule with pictures in any work environment is to err on the side of extreme caution.

More than likely your not going to find out they don't mesh with the team till they quit or are fired. If your lucky its not followed by a lawsuit.

See, its not their job to not be offended, it is your job to offer a harassment free work environment. Whom you are catering too depends on what is PC or not at the time, fortunately its pretty easy to determine whose whims to cater to or not

[+] yetanotherphd|12 years ago|reply

Your last and second last points seem to contradict each other. You claim that people who are into animal sex are "weirdos" and we should draw the line there, then you claim that people who want to block something are bullies who should not be appeased.

Also your claim that all people who wants to block pornographic images are really bullies who will not stop until all women are in burkhas is stupid in itself.

[+] _mulder_|12 years ago|reply

Here's an idea...

Develop a bot to trawl NSFW sites and hash each image (combined with the 'skin detecting' algorithms detailed previously). Then compare the user uploaded image hash with those in the NSFW database.

This technique relies on the assumption that NSFW images that are spammed onto social media sites will use images that already exist on NSFW sites (or are very similar to). Then it simply becomes a case of pattern recognition, much like SoundHound for audio, or Google Image search.

It wouldn't reliably detect 'original' NSFW material, but given enough cock shots as source material, it could probably find a common pattern over time.

edit: I've just noticed rfusca in the OP suggests a similar method

[+] Fomite|12 years ago|reply

"This multi-TB disk array labeled 'Porn' has a legitimate business use!"

[+] pa5tabear|12 years ago|reply

How do you program that sort of thing?

Do you have to tell it what shapes/colors to look for? Or do a combination of overall image similar combined with localized image similarity and portion by portion image comparison?

[+] lifeformed|12 years ago|reply

Is it possible to hash an image so that you can partially match it with subsets of that image (like cropped regions or resizes)? Or a slight modification of that image (colors shifted, image flipped, etc).

[+] mixmax|12 years ago|reply

detecting all porn seems to be an almost impossible problem. Many kinds of advanced porn (BDSM, etc.) don't have much skin - often the actors are in latex, tied up, or whatever. It's obviously porn when you see it, but detecting it seems incredibly hard.

Detecting smurf-porn(1) (yes that's a thing...) is even harder since all the actors are blue.

http://pinporngifs.blogspot.dk/2012/09/smurfs-porn.html?zx=7... - obviously very NSFW, but quite funny.

[+] ma2rten|12 years ago|reply

It is possible high accuracy if you use machine learning and a sufficiently large training set. That said said even humans sometimes don't agree is something is porn or not.

[+] tekacs|12 years ago|reply

Couldn't help but read your second line as talking about the smurf-porn(1) man page. :/

man 1 smurf-porn ?!?!

[+] eksith|12 years ago|reply

To this day, I believe the best method for picking out these images is a human censor (with appropriate, company provided, counseling afterward).

Edit: No shortage of stock image reviewer jobs https://google.com/search?hl=en&q=%22image%20reviewer%22

I'm trying to find an interview of one of these people describing what it's like on the other end. It wasn't a pleasant story. These folks are employed by the likes of Facebook, Photobucket etc... Most are outsourced, obviously, and they all have very high turnover.

[+] RossM|12 years ago|reply

I seem to remember reading an article about people doing this at Google(?) in a pretty poor state.

Edit: I think it was this one: http://www.buzzfeed.com/reyhan/tech-confessional-the-googler...

[+] VLM|12 years ago|reply

Nobody has discussed i18n and l10n issues? What passes for pr0n in SF is a bit different than tx.us and thats different from .eu and from .sa (sa is saudi arabia not south africa, although they've probably got some interesting cultural norms too)

If you're trying for "must not offend any human being on the planet" then you've got an AI problem that exceeds even my own human intelligence problem to figure out. Especially when it extends past pr0n and into stuff like satire, is that just some dudes weird self portrait, or a satire of the prophet, and are you qualified to figure it out?

[+] betterunix|12 years ago|reply

How about a picture of a woman's breasts? What about an erect penis? Sounds like porn, but you might also see these things in the context of health-related pictures or some other educational material.

The classic problem of trying to filter pornography is trying to separate it from information about human bodies. I suspect that doing this with images will be even harder than doing it with text.

[+] Zimahl|12 years ago|reply

Definitely true. Facebook had a dust-up when a woman posted a topless photo of herself after she had had a double mastectomy.

That said, not all sites are like Facebook and we aren't talking about filtering all the images on the internet, just ones on specific sites. One example I can think of is that a forum for a sports team might not want NSFW pictures posted as it would be irrelevant.

[+] quarterto|12 years ago|reply

Google reverse image search can come up with a search likely to return the given image. Perhaps this can be used for porn classification.

[+] nathanb|12 years ago|reply

Seems like we were having this same problem with email spam, and Bayesian-based learning filters revolutionized the spam filtering landscape. Has anyone tried throwing computer learning at this problem?

We as humans can readily classify images into three vague categories: clean, questionable, and pornographic. The problem of classification is not only one of determining which bucket an image falls into but also one of determining where the boundaries between buckets are. Is a topless woman pornographic? A topless man? A painting of a topless woman created centuries ago by a well-recognized artist? A painting of a topless woman done yesterday by a relatively unknown artist? An infant being bathed? A woman breastfeeding her baby? Reasonable people may disagree on which bucket these examples fall in.

So what if I create three filter sets: restrictive, moderate, and permissive, and then categorize 1,000 sample images as one of those three categories for each filter set (restrictive could be equal to moderate but filter questionable images as well as pornographic ones).

Assuming that the learning algorithm was programmed to look at a sufficiently large number of image attributes, this approach should easily be capable of creating the most robust (and learning!) filter to date.

Has anyone done this?

[+] clienthunter|12 years ago|reply

This was my first thought. With a good training set and a savvy algo I believe machine learning can be good with images, and theres an unprecedented amount of training sets out there to be scraped...

[+] Houshalter|12 years ago|reply

Everyone is focusing on the machine vision problem but the OP had a good idea:

>There are already a few image based search engines as well as face recognition stuff available so I am assuming it wouldn't be rocket science and it could be done.

Just do a reverse image search for the image, see if it comes up on any porn sites or is associated with porn words.

[+] lectrick|12 years ago|reply

Relevant:

http://en.wikipedia.org/wiki/I_know_it_when_I_see_it

Basically, it's impossible to completely accurately identify pornography without a human actor in the mix, due to the subjectivity... and especially considering that not all nudity is pornographic.

[+] primaryobjects|12 years ago|reply

This is a classic categorical problem for machine learning. I'm surprised so many suggestions have involved formulating some sort of clever algorithm like skin detection, colors, etc. You could certainly use one of those for a baseline, but I'd bet machine learning would out-score most human-derived algorithms.

Take a look at the scores for classifying dogs vs cats with 97% accuracy http://www.kaggle.com/c/dogs-vs-cats/leaderboard. You could use a technique of digitizing the image pixels and feeding to a learning algorithm, similar to http://www.primaryobjects.com/CMS/Article154.aspx.

[+] denzil_correa|12 years ago|reply

I am aware of some nice scholarly work in this space. You may find Shih et al. approach of particular interest [0]. Their approach is very straight forward and based on image retrieval. They have also reported an accuracy of 99.54% for Adult image detection in their dataset.

[0] Shih, J. L., Lee, C. H., & Yang, C. S. (2007). An adult image identification system employing image retrieval technique. Pattern Recognition Letters, 28(16), 2367-2374. Chicago

http://sjl.csie.chu.edu.tw/sjl/albums/userpics/10001/An_adul...

[+] jmngomes|12 years ago|reply

I came across nude.js (http://www.patrick-wied.at/static/nudejs/) when researching for a social network project, seems quite nice and is Javascript based.

[+] racbart|12 years ago|reply

Wouldn't testing for skin colors produce far too many false positives to be useful? All these beach photos, fashion lingerie photos, even close portraits. And how about half of music stars these days who seem to try to never get caught more clothed than half naked?

Nudity != porn and certainly half-nudity != porn.

I'd rather go for pattern recognition. There's lot of image recognition software these days that can distinguish the Eiffel Tower from the Statue of Liberty and it might be useful to detect certain body parts and certain body configurations (for these shots that don't contain any private body part but there are two bodies in an unambiguous configuration).

[+] betterunix|12 years ago|reply

"detect certain body parts"

When I was a kid, we had a firewall at school that tried to filter pornography by doing something similar with text. Doing research on breast cancer turned out to be rather tricky.

So let's say you try to detect certain body parts. Now you have someone who wants to know more about their body, but you are classifying images from medical / health articles as pornography.

"certain body configurations"

So now instead of having trouble reading about my own body, I will have trouble looking at certain martial arts photos:

https://upload.wikimedia.org/wikipedia/commons/1/14/BostonKe...

I am not saying these are unsolvable problems, but they are certainly hard problems. Even using humans to filter images tends to result in outrageous false positives sometimes:

http://abcnews.go.com/Health/breastfeeding-advocates-hold-fa...

[+] ElliotH|12 years ago|reply

When I was in a boarding college we had wardrobe doors that were the perfect colour to set off the skin-tone filters.

[+] hugofirth|12 years ago|reply

Whilst I agree that programmatically eliminating porn images is a very hard problem. Programmatically filtering porn websites might be easier, beyond just a simple key word search and whitelist.

If you assume that porn tends to cluster, rather than exist in isolation, then a crawl of other images on the source pages , applying computer vision techniques, should allow you to block pages that score above a threshold number of positive results (thus accounting for inaccuracy and false positives).

[+] VLM|12 years ago|reply

"score above a threshold number of positive results"

How about social scoring? A normal (or even a weirdo) teenage boy would spend less than a second examining my ugly old profile pix, but after ten or so of your known teen male users are detected to spend 5 minutes at a time, a couple times a day, closely studying a suspected profile pix, I think you can safely conclude that pix is not a pix of me and then flag / censor / whatever it for the next 10K users.