top | item 46970870

(no title)

vachina | 19 days ago

Scrapers are relentless but not DDoS levels in my experience.

Make sure your caches are warm and responses take no more than 5ms to construct.

discuss

mzajc|18 days ago

I'm also dealing with a scraper flood on a cgit instance. These conclusions come from just under 4M lines of logs collected in a 24h period.

- Caching helps, but is nowhere near a complete solution. Of the 4M requests I've observed 1.5M unique paths, which still overloads my server.

- Limiting request time might work, but is more likely to just cause issues for legitimate visitors. 5ms is not a lot for cgit, but with a higher limit you are unlikely to keep up with the flood of requests.

- IP ratelimiting is useless. I've observed 2M unique IPs, and the top one from the botnet only made 400 well-spaced-out requests.

- GeoIP blocking does wonders - just 5 countries (VN, US, BR, BD, IN) are responsible for 50% of all requests. Unfortunately, this also causes problems for legitimate users.

- User-Agent blocking can catch some odd requests, but I haven't been able to make much use of it besides adding a few static rules. Maybe it could do more with TLS request fingerprinting, but that doesn't seem trivial to set up on nginx.

Imustaskforhelp|18 days ago

Quick question but do these bots which you mention are from a 24H period but how long will this "attack" continue for?

Because this is something which is happening continuously & i have observed so many HN posts like these (Anubis iirc was created by its creator out of such frustration too). Git servers being scraped to the point of its effectively an DDOS.

watermelon0|18 days ago

Great, now we need caching for something that's seldom (relatively speaking) used by people.

Let's not forget that scrapers can be quite stupid. For example, if you have phpBB installed, which by defaults puts session ID as query parameter if cookies are disabled, many scrapers will scrape every URL numerous times, with a different session ID. Cache also doesn't help you here, since URLs are unique per visitor.

kimos|18 days ago

You’re describing changing the base assumption for software reachable on the internet. “Assume all possible unauthenticated urls will be hit basically constantly”. Bots used to exist but they were rare traffic spikes that would usually behave well and could mostly be ignored. No longer.

bombcar|17 days ago

The biggest problem is “dynamic” content that really isn’t - we had a tag view that allowed combinations of tags in search and the AI bots would get tangled up in there and never leave.

And each hit was server-heavy. We blocked that entire “feature”.