top | item 44944318

(no title)

tolmasky | 6 months ago

Wikipedia says their traffic increased roughly 50% [1] from AI bots, which is a lot, sure, but nowhere near the amount where you'd have to rearchitect your site or something. And this checks out, if it was actually debilitating, you'd notice Wikipedia's performance degrade. It hasn't. You'd see them taking some additional steps to combat this. They haven't. Their CDN handles it just fine. They don't even both telling AI bots to just download the tarballs they specifically make available for this exact use case.

More importantly, Wikipedia almost certainly represents the ceiling of traffic increase. But luckily, we don't have to work with such coarse estimation, because according to Cloudflare, the total increase from combined search and AI bots in the last year (May 2024 - May 2025), has just been... 18% [2].

The way you hear people talk about it though, you'd think that servers are now receiving DDOS-levels of traffic or something. For the life of me I have not been able to find a single verifiable case of this. Which if you think about it makes sense... It's hard to generate that sort of traffic, that's one of the reasons people pay for botnets. You don't bring a site to its knees merely by accidentally "not making your scraper efficient". So the only other possible explanation would be such a larger number of scrapers simultaneously but independently hitting sites. But this also doesn't check out. There aren't thousands of different AI scrapers out there that in aggregate are resulting in huge traffic spikes [2]. Again, the total combined increase is 18%.

The more you look into this accepted idea that we are in some sort of AI scraping traffic apocalypse, the less anything makes sense. You then look at this Anubis "AI scraping mitigator" and... I dunno. The author contends that one if its tricks is that it not only uses JavaScript, but "modern JavaScript like ES6 modules," and that this is one of the ways it detects/prevents AI scrapers [3]. No one is rolling their own JS engine for a scraper such that they are being blocked from their inability to keep up with the latest ECMAScript spec. You are just using an existing JS engine, all of which support all these features. It would actually be a challenge to find an old JS engine these days.

The entire things seems to be built on the misconception that the "common" way to build a scraper is doing something curl-esque. This idea is entirely based on the google scraper which itself doesn't even work that way anymore, and only ever did because it was written in the 90s. Everyone that rolls their own scraper these days just uses Puppeteer. It is completely unrealistic to make a scraper that doesn't run JavaScript and wait for the page to "settle down" because so many pages, even blogs, are just entirely client-side rendered SPAs. If I were to write a quick and dirty scraper today I would trivially make it through Anubis' protections... by doing literally nothing and without even realizing Anubis exists. Just using standard scraping practices with Puppeteer. Meanwhile Anubis is absolutely blocking plenty of real humans, with the author for example telling people to turn on cookies so that Anubis can do its job [4]. I don't think Anubis is blocking anything other than humans and Message's link preview generator.

I'm investigating further, but I think this entire thing may have started due to some confusion, but want to see if I can actually confirm this before speculating further.

1. https://www.techspot.com/news/107407-wikipedia-servers-strug... (notice the clickbait title vs. the actual contents)

2. https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...

3. https://codeberg.org/forgejo/discussions/issues/319#issuecom...

4. https://github.com/TecharoHQ/anubis/issues/964#issuecomment-...

discuss

zzo38computer|6 months ago

> It is completely unrealistic to make a scraper that doesn't run JavaScript and wait for the page to "settle down" because so many pages, even blogs, are just entirely client-side rendered SPAs.

I specifically want a search engine that does not run JavaScript, so that it only finds documents that do not require JavaScripts to display the text being searched. (This is not the same as excluding everything that has JavaScripts; some web pages use JavaScripts but can still display the text even without it.)

> Meanwhile Anubis is absolutely blocking plenty of real humans, with the author for example telling people to turn on cookies so that Anubis can do its job [4]. I don't think Anubis is blocking anything other than humans and Message's link preview generator.

These are some of the legitimiate problems with Anubis (and this is not the only way that you can be blocked by Anubis). Cloudflare can have similar problems, although its working is a bit different so it is not exactly the same working.

tolmasky|6 months ago

> I specifically want a search engine that does not run JavaScript, so that it only finds documents that do not require JavaScripts to display the text being searched. (This is not the same as excluding everything that has JavaScripts; some web pages use JavaScripts but can still display the text even without it.)

Sure... but off-topic, right? AI companies are desperate for high quality data, and unlike search scrapers, are actually not supremely time sensitive. That is to say, they don't benefit from picking up on changes seconds after they are published. They essentially take a "snapshot" and then do a training run. There is no "real-time updating" of an AI model. So they have all the time in the world to wait for a page to reach an ideal state, as well as all the incentive in the world to wait for that too. Since the data effectively gets "baked into the model" and then is static for the entire lifetime of the model, you over-index on getting the data, not getting fast, or cheap, or whatever.

xena|6 months ago

Hi, main author of Anubis here. How am I meant to store state like "user passed a check" without cookies? Please advise.

tolmasky|6 months ago

If the rest of my post is accurate, that's not the actual concern, right? Since I'm not sure if the check itself is meaningful. From what is described in the documentation [1], I think the practical effect of this system is to block users running old mobile browsers or running browsers like Opera Mini in third world countries where data usage is still prohibitively expensive. Again, the off-the-shelf scraping tools [2] will be unaffected by any of this, since they're all built on top of Puppeteer, and additionally are designed to deal with the modern SPA web which is (depressingly) more or less isomorphic to a "proof-of-work".

If you are open to jumping on a call in the next week or two I'd love to discuss directly. Without going into a ton of detail, I originally started looking into this because the group I'm working with is exploring potentially funding a free CDN service for open source projects. Then this AI scraper stuff started popping up, and all of a sudden it looked like if these reports were true it might make such a project no longer economically realistic. So we started trying to collect data and concretely nail down what we'd be dealing with and what this "post-AI" traffic looks like.

As such, I think we're 100% aligned on our goals. I'm just trying to understand what's going on here since none of the second-order effects you'd expect from this sort of phenomenon seem to be present, and none of the places where we actually have direct data seem to show this taking place (and again, Cloudflare's data seems to also agree with this). But unless you already own a CDN, it's very hard to get a good sense of what's going on globally. So I am totally willing to believe this is happening, and am very incentivized to help if so.

EDIT: My email is my HN username at gmail.com if you want to schedule something.

1. https://anubis.techaro.lol/docs/design/how-anubis-works

2. https://apify.com/apify/puppeteer-scraper

rafram|6 months ago

Cloudflare Turnstile doesn't require cookies. It stores per-request "user passed a check" state using a query parameter. So disabling cookies will just cause you to get a challenge on every request, which is annoying but ultimately fair IMO.

jddj|6 months ago

Doesn't Wikipedia offer full tarballs?

This would imaginably put some downward pressure on scraper volume.

tolmasky|6 months ago

From the first paragraph in my comment:

> You'd see them taking some additional steps to combat this. They haven't. Their CDN handles it just fine. They don't even both telling AI bots to just download the tarballs they specifically make available for this exact use case.

Yes, they do. But they aren't in a rush to tell AI companies this, because again, this is not actually a super meaningful amount of traffic increase for them.

imtringued|6 months ago

I don't think you understand the purpose of Anubis. If you did then you'd realize that running a web browser with JS enabled doesn't bypass anything.

tolmasky|6 months ago

By bypass I mean "successfully pass the challenge". Yes, I also have to sit through the Anubis interstitial pages, so I promise I know it's not being "bypassed". (I'll update the post to remove future confusion).

Do you disagree that a trivial usage of an off-the-shelf puppeteer scraper[1] has no problem doing the proof-of-work? As I mentioned in this comment [2], AI scrapers are not on some time crunch, they are happy to wait a second or two for the final content to load (there are plenty of normal pages that take longer than the Anubis proof of work does to complete), and also are unfazed by redirects. Again, these are issues you deal with normal everyday scraping. And also, do you disagree with the traffic statics from Cloudflare's site? If we're seeing anything close to that 18% increase then it would not seem to merit user-visible levels of mitigation. Even if it was 180% you wouldn't need to do this. nginx is not constantly on the verge of failing from a double digit "traffic spike".

As I mentioned in my response to the Anubis author here [3], I don't want this to be misinterpreted as a "defense of AI scrapers" or something. Our goals are aligned. The response there goes into detail that my motivation is that a project I am working on will potentially not be possible if I am wrong and this AI scraper phenomenon is as described. I have every incentive in the world to just want to get to the bottom of this. Perhaps you're right, and I still don't understand the purpose of Anubis. I want to! Because currently neither the numbers nor the mitigations seem to line up.

BTW, my same request extends to you, if you have direct experience with this issue, I'd love to jump on a call to wrap my head around this.

My email is my HN username at gmail.com if you want to reach out, I'd greatly appreciate it!

1. https://news.ycombinator.com/item?id=44944761

2. https://apify.com/apify/puppeteer-scraper

3. https://news.ycombinator.com/item?id=44944886