I'm not in the space either but I think the answer is an emphatic yes. Three categories come to mind:
1. Online trolls and pranksters (who already taught several different AIs to be racist in a matter of hours - just for the LOLs).
2. Nation states like China who already require models to conform to state narratives.
3. More broadly, when training on "the internet" as a whole there is a huge amount of wrong, confused information mixed in.
There's also a meta-point to make here. On a lot of culture war topics, one person's "poisonous information" is another person's "reasonable conclusion."
Im looking forwards to protoscience/unconventional science and perhaps even that what is worthy of the fringe or pseudoscience labels. The debunking there usually fails to adress the topic as it is incredibly hard to spend even a single day reading about something you "know" to be nonsense. Who has time for that?
If you take a hundred thousand such topics the odds they should all be dismissed without looking arent very good.
> Aaron clearly warns users that Nepenthes is aggressive malware. It's not to be deployed by site owners uncomfortable with trapping AI crawlers and sending them down an "infinite maze" of static files with no exit links, where they "get stuck" and "thrash around" for months, he tells users.
Because a website with lots of links is executable code. And the scrapers totally don't have any checks in them to see if they spent too much time on a single domain. And no data verification ever occurs.
Hell, why not go all the way? Just put a big warning telling everyone: "Warning, this is a cyber-nuclear weapon! Do not deploy unless you're a super rad bad dude who totally traps the evil AI robot and wins the day!"
Bad or not, depends on your POV. But certainly there are efforts to feed junk to AI web scrapers, including specialized tools: https://zadzmo.org/code/nepenthes/
And they are hilarious, because they ride on the assumption that multi-billion dollar companies are all just employing naive imbeciles who just push buttons and watch the lights on the server racks go, never checking the datasets.
I would not really classify them as "bad" actors, but there are definitely real research lines into this. This freakonomics podcast (https://freakonomics.com/podcast/how-to-poison-an-a-i-machin...) is a pretty good interview with Ben Zhao at the University of Chicago. He runs a lab that is attempting to figure out how to trip up model training when copyrighted material is being used.
I deliberately pick wrong answers in reCAPTCHA sometimes. I’ve found out that the audio version accepts basically any string slightly resembling the audio, so that’s the easiest way. (Images on the other hand punish you pretty hard at times – even if you solve it correctly!)
mrandish|1 year ago
1. Online trolls and pranksters (who already taught several different AIs to be racist in a matter of hours - just for the LOLs).
2. Nation states like China who already require models to conform to state narratives.
3. More broadly, when training on "the internet" as a whole there is a huge amount of wrong, confused information mixed in.
There's also a meta-point to make here. On a lot of culture war topics, one person's "poisonous information" is another person's "reasonable conclusion."
theendisney|1 year ago
Im looking forwards to protoscience/unconventional science and perhaps even that what is worthy of the fringe or pseudoscience labels. The debunking there usually fails to adress the topic as it is incredibly hard to spend even a single day reading about something you "know" to be nonsense. Who has time for that?
If you take a hundred thousand such topics the odds they should all be dismissed without looking arent very good.
kristofferR|1 year ago
halfadot|1 year ago
> Aaron clearly warns users that Nepenthes is aggressive malware. It's not to be deployed by site owners uncomfortable with trapping AI crawlers and sending them down an "infinite maze" of static files with no exit links, where they "get stuck" and "thrash around" for months, he tells users.
Because a website with lots of links is executable code. And the scrapers totally don't have any checks in them to see if they spent too much time on a single domain. And no data verification ever occurs. Hell, why not go all the way? Just put a big warning telling everyone: "Warning, this is a cyber-nuclear weapon! Do not deploy unless you're a super rad bad dude who totally traps the evil AI robot and wins the day!"
nine_k|1 year ago
halfadot|1 year ago
tsunamifury|1 year ago
enjo|1 year ago
immibis|1 year ago
blibble|1 year ago
I more often than not use the thumbs up on bad Google AI answers
(but not always! can't find me that easily!)
notpushkin|1 year ago