top | item 46578609

(no title)

I don't see how you get around LLMs scraping data without also stopping humans from retrieving valid data.

If you are NYTimes and publish poisoned data to scrapers, the only thing the scraper needs is one valid human subscription where they run a VM + automated Chrome, OCR and tokenize the valid data then compare that to the scraped results. It's pretty much trivial to do. At Anthropic/Google/OpenAI scale they can easily buy VMs in data centers spread all over the world with IP shuffling. There is no way to tell who is accessing the data.

discuss

conartist6|1 month ago

I don't see how you can stop the LLMs ingesting any poison either, because they're filling up the internet with low-value crap as fast as they possibly can. All that junk is poisonous to training new models. The wellspring of value once provided by sites like StackoverFlow is now all but dried up. AI culture is devaluing at an incredible rate as it churns out copied and copies and copies and more copies of the same worthless junk.

Ifkaluva|1 month ago

The big labs spend a ton of effort on dataset curation, precisely to prevent them from ingesting poison as you put it.

It goes further than that—they do lots of testing on the dataset to find the incremental data that produces best improvements on model performance, and even train proxy models that predict whether data will improve performance or not.

“Data Quality” is usually a huge division with a big budget.

stanfordkid|1 month ago

Just look at the domains. Obviously social media will get harder to do this with, maybe that's okay though. I think a simple criterion can be used: could the pre-trained LLM have come up with this itself? If so it probably doesn't have training value.

8bitsrule|1 month ago

>I don't see how you get around LLMs scraping data without also stopping humans from retrieving valid data.

I do a lot of online research. I find that many information sources have a prominent copyright notice on their pages. Since the LLM's can read, that ought to be a stopper.

I'm getting tired of running into all of these "verifying if you're human" checks ... which often fail miserably and keep me from reading (not copying) the pages they're paid to 'protect'.

(It's not as though using the web wasn't already much harder in recent years.)

WillPostForFood|1 month ago

ciaranmca|1 month ago

And most of the big players now have some kind of browser or bowser agent that they could just leverage to gather training data from locked down sources.

voidUpdate|1 month ago

> I don't see how you get around LLMs scraping data without also stopping humans from retrieving valid data

Well LLM scrapers love to scrape All The Pages, so just have some disallowed pages in your robots.txt that aren't for humans to see and watch LLM scrapers consume them

nedt|1 month ago

Just look at real people. They can get the valid data from sources with a good reputation. Instead they rather want to believe what they get from a random telegram channel. Having valid data doesn't stop the existence of idiots.

th0ma5|1 month ago

[deleted]