top | item 43422645

(no title)

ericholscher | 11 months ago

Yep -- our story here: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse... (quoted in the OP) -- everyone I know has a similar story who is running large internet infrastructure -- this post does a great job of rounding a bunch of them up in 1 place.

I called it when I wrote it, they are just burning their goodwill to the ground.

I will note that one of the main startups in the space worked with us directly, refunded our costs, and fixed the bug in their crawler. Facebook never replied to our emails, the link in their User Agent led to a 404 -- an engineer at the company saw our post and reached out, giving me the right email -- which I then emailed 3x and never got a reply.

discuss

order

pjc50|11 months ago

> just burning their goodwill to the ground

AI firms seem to be leading from a position that goodwill is irrelevant: a $100bn pile of capital, like an 800lb gorilla, does what it wants. AI will be incorporated into all products whether you like it or not; it will absorb all data whether you like it or not.

UncleMeat|11 months ago

Yep. And it is much more far reaching than that. Look at the primary economic claim offered by AI companies: to end the need for a substantial portion of all jobs on the planet. The entire vision is to remake the entire world into one where the owners of these companies own everything and are completely unconstrained. All intellectual property belongs to them. All labor belongs to them. Why would they need good will when they own everything?

"Why should we care about open source maintainers" is just a microcosm of the much larger "why should we care about literally anybody" mindset.

yubblegum|11 months ago

They are also gutting the profession of software engineering. It's a clever scam actually: to develop software a company will need to pay utility fees to A"I" companies and since their products are error prone voila use more A"I" tools to correct the errors of the other tools. Meanwhile software knowledge will atrophy and soon ala WALE we'll have software "developers" with 'soft bones' floating around on conveyed seats slurping 'sugar water' and getting fat and not knowing even how to tie their software shoelaces.

b112|11 months ago

Yes, like the Pixel camera app, which mangles photos with AI processing, and users complain that it won't let people take pics.

One issue was a pic with text in it, like a store sign. Users were complaining that it kept asking for better focus on the text in the background, before allowing a photo. Alpha quality junk.

Which is what AI is, really.

anthk|11 months ago

AI tarpits && lim (human curated contant/mediocre AI answers -> 0) = AI's crumbling into dust by themselves.

davidmurdoch|11 months ago

We, the people, might need to come up with a few proverbial tranquilizer guns here soon

Sharlin|11 months ago

Maxim 1: "Pillage, then burn."

ferguess_k|11 months ago

That's pretty much what our future would look like -- you are irrelevant. Well I mean we are already pretty much irrelevant nowadays, but the more so in the "progressive" future of AI.

asveikau|11 months ago

Rules and laws are for other people. A lot of people reading this comment having mistaken "fake it til you make it" or "better to not ask permission" for good life advice are responsible for perpetrating these attitudes, which are fundamentally narcissistic.

slowmovintarget|11 months ago

"... you have the lawyers clean it all up later." - Eric Schmidt

kordlessagain|11 months ago

> AI will be incorporated into all products whether you like it or not

AI will be incorporated into the government, whether you like it or not.

FTFY!

huijzer|11 months ago

I think the logic is more like “we have to do everything we can to win or we will disappear”. Capitalism is ruthless and the big techs finally have some serious competition, namely: each other as well as new entrants.

Like why else can we just spam these AI endpoints and pay $0.07 at the end of the month? There is some incredible competition going on. And so far everyone except big tech is the winner so that’s nice.

lgeek|11 months ago

> One crawler downloaded 73 TB of zipped HTML files in May 2024 [...] This cost us over $5,000 in bandwidth charges

I had to do a double take here. I run (mostly using dedicated servers) infrastructure that handles a few hundred TB of traffic per month, and my traffic costs are on the order of $0.50 to $3 per TB (mostly depending on the geographical location). AWS egress costs are just nuts.

Ray20|11 months ago

I think uncontrolled price of cloud traffic - is a real fraud and way bigger problem then some AI companies that ignore robot.txt. One time we went over limit on Netlify or something, and they charged over thousand for a couple TB.

Suppafly|11 months ago

>which I then emailed 3x and never got a reply.

Send a bill to their accounts payable team instead.

ldoughty|11 months ago

Detect AI scraper and inject an in-page notice that by continuing they accept your terms of use.

Terms of use charges them per page load in some terminology of abuse.

Profit... By sending them invoices :-)

TuringNYC|11 months ago

>> which I then emailed 3x and never got a reply.

At which point does the crawling cease to be a bug/oversight and constitute a DDOS?

ferguess_k|11 months ago

Maybe just feed them dynamically generated garbage information? More fun than no information.

gnz11|11 months ago

OP’s linked blog post mentioned they got hit with a large spike in bandwidth charges. Sending them garbage information costs money.

InfamousRece|11 months ago

It does not even have to be dynamically generated. Just pre-generate a few thousand static pages of AI slop and serve that. Probably cheaper than dynamic generation.

m463|11 months ago

I kind of suspect some of these companies probably have more horsepower and bandwidth in one crawler than a lot of these projects have in their entire infrastructure.

spenczar5|11 months ago

Thanks for writing about this. Is it clear that this is from crawlers, as opposed to dynamic requests triggered by LLM tools, like Claude Code fetching docs on the fly?

Freebytes|11 months ago

Along with having block lists, perhaps you could add poison to your results that generates random bad code that will not work, and that is only seen by bots (display: none when rendered), and the bots will use it, but a human never would.

ATechGuy|11 months ago

Wondering if used tried stopping such bots with Captcha?