top | item 46836344

(no title)

megous | 1 month ago

I'd still like the ability to just block a crawler by its IP range, but these days nope.

1 Hz is 86400 hits per day, or 600k hits per week. That's just one crawler.

Just checked my access log... 958k hits in a week from 622k unique addresses.

95% is fetching random links from u-boot repository that I host, which is completely random. I blocked all of the GCP/AWS/Alibaba and of course Azure cloud IP ranges.

It's almost all now just comming of a "residential" and "mobile" IP address space from completely random places all around the world. I'm pretty sure my u-boot fork is not that popular. :-D

Every request is a new IP address, and available IP space of the crawler(s) is millions of addresses.

I don't host a popular repo. I host a bot attraction.

discuss

order

edg5000|29 days ago

In addition to a rate limit, a page limit per IP is needed; this is specifically for things like source code repos (with massive commit histories), mailing archives, etc.

A whitelist would be needed for sites where getting all the pages make sense. And probably in addition to the 1Hz, an additional limit of 1k/day would be needed.

I can see now why Google has not much solid competition (Yandex/Baidu arguably don't compete due to network segmentation).

Scraping reliably is hard, and the chance of kicking Google off their throne may be even further reduced due to AI crawler abuse.

PS 958k hits is a lot! Even if your pages were a tiny 7.8k each (HN front page minus assets), that would be about 7G of data (about 4.6 Bee Movies in 720p h256).

kstrauser|1 month ago

I’ve been enduring that exact same traffic pattern.

I used Anubis and a cookie redirect to cut the load on my Forgejo server by around 3 orders of magnitude: https://honeypot.net/2025/12/22/i-read-yann-espositos-blog.h...

plagiarist|1 month ago

Aha, that's where the anime girl is from. What sort of traffic was getting past that but still thwarted by the cookie tactic?

I guess the bots are all spoofing consumer browser UAs and just the slightest friction outside of well-known tooling will deter them completely.