top | item 44943279

(no title)

bonaldi | 6 months ago

Not sure the emotive language is warranted. Message appears to be “if you use robots.txt AND archive sites honor it AND you are dumb enough to delete your data without a backup THEN you won’t have a way to recover and you’ll be sorry”.

It also presumes that dealing with automated traffic is a solved problem, which with the volumes of LLM scraping going on, is simply not true for more hobbyist setups.

discuss

QuercusMax|6 months ago

I just plain don't understand what they mean by "suicide note" in this case, and it doesn't seem to be explained in the text.

A better analogy would be "Robots.txt is a note saying your backdoor might be unlocked".

stickfigure|6 months ago

The meaning is reasonably clear to me: Robots.txt says "Don't archive this data. When the website dies, all the information dies with it." It's a kind of death pact.

chao-|6 months ago

I also cannot figure out from context what part of this is "suicide".

I don't even think it's a note saying your back door is unlocked? As myself and others shared in a sibling comment thread, we have worked at places that implemented robots.txt in order to prevent bots from getting into nearly-infinite tarpits of links that lead to nearly-identical pages.

paulddraper|6 months ago

> volumes of LLM scraping

FWIW I have not seen a reputable report on the % of web scraping in the past 3 years.

(Wikipedia being a notable exception...but I would guess Wikipedia to see a far larger increase than anything else.)

esseph|6 months ago

It's hard because of attribution, but it absolutely is happening at very high volume. I actually got an alert this morning when I woke up from our monitoring tools that some external sites were being scraped. Happens multiple times a day.

A lot of it is coming through compromised residential endpoint botnets.

tolmasky|6 months ago

Wikipedia says their traffic increased roughly 50% [1] from AI bots, which is a lot, sure, but nowhere near the amount where you'd have to rearchitect your site or something. And this checks out, if it was actually debilitating, you'd notice Wikipedia's performance degrade. It hasn't. You'd see them taking some additional steps to combat this. They haven't. Their CDN handles it just fine. They don't even both telling AI bots to just download the tarballs they specifically make available for this exact use case.

More importantly, Wikipedia almost certainly represents the ceiling of traffic increase. But luckily, we don't have to work with such coarse estimation, because according to Cloudflare, the total increase from combined search and AI bots in the last year (May 2024 - May 2025), has just been... 18% [2].

The way you hear people talk about it though, you'd think that servers are now receiving DDOS-levels of traffic or something. For the life of me I have not been able to find a single verifiable case of this. Which if you think about it makes sense... It's hard to generate that sort of traffic, that's one of the reasons people pay for botnets. You don't bring a site to its knees merely by accidentally "not making your scraper efficient". So the only other possible explanation would be such a larger number of scrapers simultaneously but independently hitting sites. But this also doesn't check out. There aren't thousands of different AI scrapers out there that in aggregate are resulting in huge traffic spikes [2]. Again, the total combined increase is 18%.

The more you look into this accepted idea that we are in some sort of AI scraping traffic apocalypse, the less anything makes sense. You then look at this Anubis "AI scraping mitigator" and... I dunno. The author contends that one if its tricks is that it not only uses JavaScript, but "modern JavaScript like ES6 modules," and that this is one of the ways it detects/prevents AI scrapers [3]. No one is rolling their own JS engine for a scraper such that they are being blocked from their inability to keep up with the latest ECMAScript spec. You are just using an existing JS engine, all of which support all these features. It would actually be a challenge to find an old JS engine these days.

The entire things seems to be built on the misconception that the "common" way to build a scraper is doing something curl-esque. This idea is entirely based on the google scraper which itself doesn't even work that way anymore, and only ever did because it was written in the 90s. Everyone that rolls their own scraper these days just uses Puppeteer. It is completely unrealistic to make a scraper that doesn't run JavaScript and wait for the page to "settle down" because so many pages, even blogs, are just entirely client-side rendered SPAs. If I were to write a quick and dirty scraper today I would trivially make it through Anubis' protections... by doing literally nothing and without even realizing Anubis exists. Just using standard scraping practices with Puppeteer. Meanwhile Anubis is absolutely blocking plenty of real humans, with the author for example telling people to turn on cookies so that Anubis can do its job [4]. I don't think Anubis is blocking anything other than humans and Message's link preview generator.

I'm investigating further, but I think this entire thing may have started due to some confusion, but want to see if I can actually confirm this before speculating further.

1. https://www.techspot.com/news/107407-wikipedia-servers-strug... (notice the clickbait title vs. the actual contents)

2. https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...

3. https://codeberg.org/forgejo/discussions/issues/319#issuecom...

4. https://github.com/TecharoHQ/anubis/issues/964#issuecomment-...

bigbuppo|6 months ago

Or major web properties for that matter.