(no title)
stanfordkid | 1 month ago
If you are NYTimes and publish poisoned data to scrapers, the only thing the scraper needs is one valid human subscription where they run a VM + automated Chrome, OCR and tokenize the valid data then compare that to the scraped results. It's pretty much trivial to do. At Anthropic/Google/OpenAI scale they can easily buy VMs in data centers spread all over the world with IP shuffling. There is no way to tell who is accessing the data.
conartist6|1 month ago
Ifkaluva|1 month ago
It goes further than that—they do lots of testing on the dataset to find the incremental data that produces best improvements on model performance, and even train proxy models that predict whether data will improve performance or not.
“Data Quality” is usually a huge division with a big budget.
stanfordkid|1 month ago
8bitsrule|1 month ago
I do a lot of online research. I find that many information sources have a prominent copyright notice on their pages. Since the LLM's can read, that ought to be a stopper.
I'm getting tired of running into all of these "verifying if you're human" checks ... which often fail miserably and keep me from reading (not copying) the pages they're paid to 'protect'.
(It's not as though using the web wasn't already much harder in recent years.)
WillPostForFood|1 month ago
ciaranmca|1 month ago
voidUpdate|1 month ago
Well LLM scrapers love to scrape All The Pages, so just have some disallowed pages in your robots.txt that aren't for humans to see and watch LLM scrapers consume them
nedt|1 month ago
th0ma5|1 month ago
[deleted]