I just scraped data from reddit and other sources so i could build a nsfw classifier and chose to open source the data and the model for general good.
Note that i was a 1 year experienced engineer working solely on this project in my free time, so it was basically impossible for me to review or clear out the few csam images in the 100,000+ images in the dataset.
Although, now i wonder if i should never have open sourced the data. Would have avoided lot of these issues.
Im the developers who actually got banned because of this dataset. I used NudeNet offline to benchmark my on-device NSFW app Punge — nothing uploaded, nothing shared.
Your dataset wasn’t the problem. The real problem is that independent developers have zero access to the tools needed to detect CSAM, while Big Tech keeps those capabilities to itself.
Meanwhile, Google and other giants openly use massive datasets like LAION-5B — which also contained CSAM — without facing any consequences at all. Google even used early LAION data to train one of its own models. Nobody bans Google.
But when I touched NudeNet for legitimate testing, Google deleted 130,000+ files from my account, even though only ~700 images out of ~700,000 were actually problematic. That’s not safety — that’s a detection system wildly over firing with no independent oversight and no accountability.
Big Tech designed a world where they alone have the scanning tools and the immunity when those tools fail. Everyone else gets punished for their mistakes.
So yes — your dataset has done good. ANY data set is subject to this. There needs to be tools and process for all.
But let’s be honest about where the harm came from: a system rigged so only Big Tech can safely build or host datasets, while indie developers get wiped out by the exact same automated systems Big Tech exempts itself from.
I in no way want to underplay the seriousness of child sexual abuse, but as a naturist I find all this paranoia around nudity and “not safe for work” to be somewhere between hilarious and bewildering. Normal is what you grew up with I guess, and I come from an FKK family. What’s so shocking about a human being? All that stuff in public speaking about “imagine your audience is naked”. Yeah, fine: so what’s Plan B?
Back when the first moat creation gambit for AI failed (that they were creating SkyNet so the government needs to block anyone else from working on SkyNet since only OpenAI can be trusted to control it not just any rando) they moved onto the safety angle with the same idea. I recall seeing an infographic that all the major players were signed onto some kind of safety pledge, Meta, OpenAI, Microsoft, etc. Basically they didn't want anyone else training on the whole world's data because only they could be trusted to not do nefarious things with it. The infographic had a statement about not training on CSAM and revenge porn and the like but the corpospeak it was worded in made it sound like they were promising not to do it anymore, not that they never did.
I've tried to find this graphic against several times over the years but it's either been scrubbed from the internet or I just can't remember enough details to find it. Amusingly, it only just occurred to me that maybe I should ask ChatGPT to help me find it.
> The infographic had a statement about not training on CSAM and revenge porn and the like but the corpospeak it was worded in made it sound like they were promising not to do it anymore, not that they never did.
We know they did, an earlier version of the LAION dataset was found to contain CSAM after everyone had already trained their image generation models on it.
As a small point of order, they did not get banned for "finding CSAM" like the outrage- and clickbait title claims. They got banned for uploading a data set containing child porn to Google Drive. They did not find it themselves, and them later reporting the data set to an appropriate organization is not why they got banned.
I’m the person who got banned. And just to be clear: the only reason I have my account back is because 404 Media covered it. Nobody else would touch the story because it happened to a nobody. There are probably a lot of “nobodies” in this thread who might someday need a reporter like Emanuel Maiberg to actually step in. I’m grateful he did.
The dataset had been online for six years. In my appeal I told Google exactly where the data came from — they ignored it. I was the one who reported it to C3P, and that’s why it finally came down. Even after Google flagged my Drive, the dataset stayed up for another two months.
So this idea that Google “did a good thing” and 404 somehow did something wrong is just absurd.
>They got banned for uploading child porn to Google Drive
They uploaded the full "widely-used" training dataset, which happened to include CSAM (child sexual abuse material).
While the title of the article is not great, your wording here implies that they purposefully uploaded some independent CSAM pictures, which is not accurate.
Just a few days ago I was doing some low paid (well, not so low) Ai classification task - akin to mechanical turk ones - for a very big company and was - involuntarily, since I guess they don't review them before showing - shown an ai image by the platform depicting a naked man and naked kid. though it was more barbie like than anything else. I didn't really enjoy the view tbh, contacted them but got no answer back
If the picture truly was of a child, the company is _required_ to report CSAM to NCMEC. It's taken very seriously. If they're not being responsive, escalate and report it yourself so you don't have legal problems.
This raises an interesting point. Do you need to train models using CSAM so that the model can self-enforce restrictions on CSAM? If so, I wonder what moral/ethical questions this brings up.
It's a delicate subject but not an unprecedented one. Automatic detection of already known CSAM images (as opposed to heuristic detection of unknown images) has been around for much longer than AI, and for that service to exist someone has to handle the actual CSAM before it's reduced to a perceptual hash in a database.
Maybe AI-based heuristic detection is more ethically/legally fraught since you'd have to stockpile CSAM to train on, rather than hashing then destroying your copy immediately after obtaining it.
I know what porn looks like. I know what children look like. I do not need to be shown child porn in order to recognize it if I saw it. I don't think there's an ethical dilemma here; there is no need if LLMs have the capabilities we're told to expect.
Being banned for pointing out ”Emperor’s new clothes”, is what autocrats typically do, because the worst thing they know is when anyone embarrass them.
Technofeudalism strikes again. MAANG can ban people at any time for anything without appeal, and sometimes at the whim of any nation state. Reversal is the rare exception, not the rule, and only happens occasionally due to public pressure.
Slightly unrelated, but I wonder if a 17-year old child sends her dirty photo to a 18-year old guy she likes, who goes to jail? Just curious how the law works if there is no "abuse" element.
Both of them, still, in some places, although she may get more lenient treatment because she's a juvenile. Other places have cleaned that up in various ways, although I think he's usually still at risk unless he actively turns her in to the Authorities(TM) for sending the picture.
And there's a subset of crusaders (not all of them, admittedly) who will say, with a straight face, that there is abuse involved. To wit, she abused herself by creating and sending the image, and he abused her either by giving her the idea, or by looking at it.
It comes down to prosecutorial discretion, and that can go either way.
Prosecutors have broad discretion to proceed with a matter based on whether there is a reasonable prospect of securing a conviction, whether it’s in the public interest to do so and various other factors. They don’t generally bring a lot of rigour to these considerations.
Obviously this depends on the country, but many countries have so-called "Romeo and Juliet" laws which carve out specific exclusions for situations along these lines.
The penalties for unknowingly possessing or transmitting child porn are far too harsh, both in this case and in general (far beyond just Google's corporate policies).
Again, to avoid misunderstandings, I said unknowingly - I'm not defending anything about people who knowingly possess or traffic in child porn, other than for the few appropriate purposes like reporting it to the proper authorities when discovered.
On one hand, I would like to say this could happen to anyone, on the other hand, what the F?? why are people passing around a dataset that contains child sexual abuse material??, and on another hand, I think this whole thing just reeks of techy-bravado, and I don’t exactly blame him. If one of the inputs of your product (OpenAI, google, microsoft, meta, X) is a dataset that you can’t even say for sure does not contain child pornography, that’s pretty alarming.
gillesjacobs|2 months ago
winchester6788|2 months ago
I just scraped data from reddit and other sources so i could build a nsfw classifier and chose to open source the data and the model for general good.
Note that i was a 1 year experienced engineer working solely on this project in my free time, so it was basically impossible for me to review or clear out the few csam images in the 100,000+ images in the dataset.
Although, now i wonder if i should never have open sourced the data. Would have avoided lot of these issues.
markatlarge|2 months ago
Your dataset wasn’t the problem. The real problem is that independent developers have zero access to the tools needed to detect CSAM, while Big Tech keeps those capabilities to itself.
Meanwhile, Google and other giants openly use massive datasets like LAION-5B — which also contained CSAM — without facing any consequences at all. Google even used early LAION data to train one of its own models. Nobody bans Google. But when I touched NudeNet for legitimate testing, Google deleted 130,000+ files from my account, even though only ~700 images out of ~700,000 were actually problematic. That’s not safety — that’s a detection system wildly over firing with no independent oversight and no accountability.
Big Tech designed a world where they alone have the scanning tools and the immunity when those tools fail. Everyone else gets punished for their mistakes. So yes — your dataset has done good. ANY data set is subject to this. There needs to be tools and process for all.
But let’s be honest about where the harm came from: a system rigged so only Big Tech can safely build or host datasets, while indie developers get wiped out by the exact same automated systems Big Tech exempts itself from.
qubex|2 months ago
unknown|2 months ago
[deleted]
deltoidmaximus|2 months ago
I've tried to find this graphic against several times over the years but it's either been scrubbed from the internet or I just can't remember enough details to find it. Amusingly, it only just occurred to me that maybe I should ask ChatGPT to help me find it.
jsheard|2 months ago
We know they did, an earlier version of the LAION dataset was found to contain CSAM after everyone had already trained their image generation models on it.
https://www.theverge.com/2023/12/20/24009418/generative-ai-i...
jsnell|2 months ago
markatlarge|2 months ago
The dataset had been online for six years. In my appeal I told Google exactly where the data came from — they ignored it. I was the one who reported it to C3P, and that’s why it finally came down. Even after Google flagged my Drive, the dataset stayed up for another two months.
So this idea that Google “did a good thing” and 404 somehow did something wrong is just absurd.
Google is abusing its monopoly in all kinds of ways, including quietly wiping out independent developers: https://medium.com/@russoatlarge_93541/déjà-vu-googles-using...
jfindper|2 months ago
They uploaded the full "widely-used" training dataset, which happened to include CSAM (child sexual abuse material).
While the title of the article is not great, your wording here implies that they purposefully uploaded some independent CSAM pictures, which is not accurate.
jeffbee|2 months ago
amarcheschi|2 months ago
ipython|2 months ago
See https://report.cybertip.org/.
kennyloginz|2 months ago
giantg2|2 months ago
jsheard|2 months ago
Maybe AI-based heuristic detection is more ethically/legally fraught since you'd have to stockpile CSAM to train on, rather than hashing then destroying your copy immediately after obtaining it.
boothby|2 months ago
mflkgknr|2 months ago
burnt-resistor|2 months ago
codedokode|2 months ago
Hizonner|2 months ago
And there's a subset of crusaders (not all of them, admittedly) who will say, with a straight face, that there is abuse involved. To wit, she abused herself by creating and sending the image, and he abused her either by giving her the idea, or by looking at it.
xethos|2 months ago
https://laws-lois.justice.gc.ca/eng/acts/c-46/section-163.1....
bigfatkitten|2 months ago
Prosecutors have broad discretion to proceed with a matter based on whether there is a reasonable prospect of securing a conviction, whether it’s in the public interest to do so and various other factors. They don’t generally bring a lot of rigour to these considerations.
djoldman|2 months ago
Otherwise you could send one image to every American email account and put every American adult in prison.
jfindper|2 months ago
josefritzishere|2 months ago
bsowl|2 months ago
jkaplowitz|2 months ago
Again, to avoid misunderstandings, I said unknowingly - I'm not defending anything about people who knowingly possess or traffic in child porn, other than for the few appropriate purposes like reporting it to the proper authorities when discovered.
jmogly|2 months ago
UberFly|2 months ago
unknown|2 months ago
[deleted]
stronglikedan|2 months ago