top | item 46164367

(no title)

james2doyle | 2 months ago

You call it extortion of the AI companies, but isn’t stealing/crawling/hammering a site to scrape their content to resell just as nefarious? I would say Cloudflare is giving these site owners an option to protect their content and as a byproduct, reduce their own costs of subsidizing their thieves. They can choose to turn off the crawl protection. If they aren't, that tells you that they want it, doesn’t it?

discuss

order

cpncrunch|2 months ago

>You call it extortion of the AI companies, but isn’t stealing/crawling/hammering a site to scrape their content to resell just as nefarious?

You can easily block ChatGPT and most other AI scrapers if you want:

https://habeasdata.neocities.org/ai-bots

james2doyle|2 months ago

This is just using robots.txt and asking "pretty please, don’t scrape me".

Here is an article (from TODAY) about the case where Perplexity is being accused of ignoring robots.txt: https://www.theverge.com/news/839006/new-york-times-perplexi...

If you think a robots.txt is the answer to stopping the billion-dollar AI machine from scraping you, I don’t know what to say.

jacobgkau|2 months ago

I'm guessing you don't manage any production web servers?

robots.txt isn't even respected by all of the American companies. Chinese ones (which often also use what are essentially botnets in Latin American and the rest of the world to evade detection) certainly don't care about anything short of dropping their packets.

mplewis|2 months ago

No you cannot! I blocked all of the user agents on a community wiki I run, and the traffic came back hours later masquerading as Firefox and Chrome. They just fucking lie to you and continue vacuuming your CPU.

Sohcahtoa82|2 months ago

How are you this naive? Do you really think scrapers give a damn about your robots.txt?

chrneu|2 months ago

this is the equivalent of asking people not to speed on your street.

literalAardvark|2 months ago

Tell me you don't run a site without telling me you don't run a site