top | item 47017766

(no title)

f33d5173 | 15 days ago

So instead of scraping IA once, the AI companies will use residential proxies and each scrape the site themselves, costing the news sites even more money. The only real loser is the common man who doesn't have the resources to scrape the entire web himself.

I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent. This would also make it trivial to stand up a small website without worrying about it get hug-of-deathed, since others would rehost your content for you. Shame IPFS never went anywhere.

discuss

order

CqtGLRGcukpy|15 days ago

The AI companies won't just scrape IA once, they're keeping come back to the same pages and scraping them over and over. Even if nothing has changed.

This is from my experience having a personal website. AI companies keep coming back even if everything is the same.

giancarlostoro|15 days ago

Weird, considering IA has most of its content in a way you could rehost it all idk why nobody’s just hosting a IA carbon copy that AI companies can hit endlessly, and then cutting IA a nice little check in the process, but I guess some of the wealthiest AI startups are very frugal about training data?

This also goes back to something I said long ago, AI companies are relearning software engineering poorly. I can think of so many ways to speed up AI crawlers, im surprised someone being paid 5x my salary cannot.

zmmmmm|15 days ago

yeah, they should really have a think about how their behavior is harming their future prospects here.

Just because you have infinite money to spend on training doesn't mean you should saturate the internet with bots looking for content with no constraints - even if that is a rounding error of your cost.

We just put heavy constraints on our public sites blocking AI access. Not because we mind AI having access - but because we can't accept the abusive way they execute that access.

dawnerd|14 days ago

It’s insane actually how fast to re-request the same pages, even 404s. They’re so desperate for data they’re really hurting smaller hosts. One of our clients site became unusable when one of the ai bots started spamming the Wordpress search for terms that I’m guessing users were searching for but were unrelated to the sites content. Instead of building a search index they’re just hammering sites directly. So annoying.

nickpsecurity|14 days ago

It can be 10,000 requests a day on static HTML and non-existent, PHP pages. That's on my site. I'd rather them have Christ-centered and helpful content in their pretraining. So, I still let them scrape it for the public good.

It helps to not have images, etc that would drive up bandwidth cost. Serving HTML is just pennies a month with BunnyCDN. If I had heavier content, I might have to block them or restrict it to specific pages once per day. Maybe just block the heavy content, like the images.

Btw, anyone tried just blocking things like images to see if scaping bandwidth dropped to acceptable levels?

iririririr|15 days ago

> The AI companies won't just scrape IA once, they're keeping come back to the same pages and scraping them over and over. Even if nothing has changed.

Maybe they vibecoded the crawlers. I wish I were joking.

anonnon|15 days ago

> The AI companies won't just scrape IA once, they're keeping come back to the same pages and scraping them over and over. Even if nothing has changed.

Why, though? Especially if the pages are new; aren't they concerned about ingesting AI-generated content?

Operyl|15 days ago

They already are, I've been dealing with Vietnam and Korea residential proxies destroying my systems for weeks, I'm growing tired. I cannot survive 3500 RPS 24/7.

shark_laser|15 days ago

> I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent. This would also make it trivial to stand up a small website without worrying about it get hug-of-deathed, since others would rehost your content for you. Shame IPFS never went anywhere.

You've just described Nostr: Content that is tied to a hash (so its origin and authenticity can be verified) that is hosted by third parties (or yourself if you want)

Hendrikto|14 days ago

With Nostr you can host your content anywhere, but for it to actually be discoverable, you need to declare that host. Third parties therefore cannot really solve the problem for you, without your help.

demetris|15 days ago

I don’t believe resips will be with us for long, at least not to the extent they are now. There is pressure and there are strong commercial interests against the whole thing. I think the problem will solve itself in some part.

Also, I always wonder about Common Crawl:

Is there is something wrong with it? Is it badly designed? What is it that all the trainers cannot find there so they need to crawl our sites over and over again for the exact same stuff, each on its own?

ccgreg|15 days ago

Many AI projects in academia or research get all of their web data from Common Crawl -- in addition to many not-AI usages of our dataset.

The folks who crawl more appear to mostly be folks who are doing grounding or RAG, and also AI companies who think that they can build a better foundational model by going big. We recommend that all of these folks respect robots.txt and rate limits.

Denatonium|14 days ago

It would be nice if IA could create a browser extension or TLS-intercepting proxy that end users can run over their own computers and connections, allowing crowd-sourced scraping. It would need an allow/deny-listing feature for sites to passively crawl, and I'm not sure how you could prevent data poisoning, but it would at least get around the issues of blocking.

WalterBright|14 days ago

> I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent.

I wrote a short paper on that 25 years ago, but it went nowhere. I still think it is a great idea!

j45|14 days ago

Blocking the internet archive sounds like non-tech leadership making decisions without understanding how ubiquitous and moot it is to simply get it another way.

Kind of sucks because the news are an important part of that kind of an archive.

raincole|15 days ago

Even if the site is archived on IA, AI companies will still do the same.

pigggg|15 days ago

AI companies are _already_ funding and using residential proxies. Guess how much of those proxies are acquired through being compromised or tricking people into installing apps?

golem14|15 days ago

Does anyone know if Teslas do this? I noticed Tesla cars want to have access to local WiFi and eat up oodles of bandwidth …

jeron|14 days ago

>The only real loser is the common man who doesn't have the resources to scrape the entire web himself.

definitely, this is going to hurt those over at /r/datahoarder

toomuchtodo|15 days ago

AI browsers will be the scrapers, shipping content back to the mothership for processing and storage as users co browse with the agentic browser.

Aurornis|15 days ago

> So instead of scraping IA once, the AI companies will use residential proxies and each scrape the site themselves, costing the news sites even more money.

News websites aren’t like those labyrinthian cgit hosted websites that get crushed under scrapers. If 1,000 different AI scrapers hit a news website every hour it wouldn’t even make a blip on the traffic logs.

Also, AI companies are already scraping these websites directly in their own architecture. It’s how they try to stay relevant and fresh.

dawnerd|14 days ago

Hello hi, I work on a news site and we absolutely notice and it does mess up traffic logs.

terminalshort|15 days ago

But don't you have to sign a license agreement that prohibits scraping in order to purchase a subscription that allows you to bypass the paywall?

nerdponx|15 days ago

It's almost as if this isn't about scraping and more about shutting down a "free article sharing" channel that gets abused all the time.

lxgr|15 days ago

But hey, paywalled sites might be getting 2-3 additional subscriptions out of it!

zaphirplane|15 days ago

We don’t lack the technology to limit scrapers, sure it’s an arms race with AI companies with more money than most. Why can’t this be a legal block through TOS