A year or two ago I personally encountered scraping bots that were scraping every possible resultant page from a given starting point. So if it scraped a search results page it would also scrape every single distinct combination of facets on that search (including nonsensical combinations e.g. products that match the filter "products where weight<2lbs AND weight>2lbs")
We ended up having to block entire ASNs and several subnets (lots from Facebook IPs, interestingly)
If you have a lot of pages, AI bots will scrape every single one on a loop - wiki's generally don't have anywhere near the number of pages as an incremented entity primary id. I have a few million pages on a tiny website and it gets hammered by AI bots all day long. I can handle it, but it's a nuisance and they're basically just scraping garbage (statistics pages of historical matches or user pages that have essentially no content).
Many of them don't even self-identify and end up scraping with shrouded user-agents or via bot-farms. I've had to block entire ASNs just to tone it down. It also hurts good-faith actors who genuinely want to build on top of our APIs because I have to block some cloud providers.
I would guess that I'm getting anywhere from 10-25 AI bot requests (maybe more) per real user request - and at scale that ends up being quite a lot. I route bot traffic to separate pods just so it doesn't hinder my real users' experience[0]. Keep in mind that they're hitting deeply cold links so caching doesn't do a whole lot here.
[0] this was more of a fun experiment than anything explicitly necessary, but it's proven useful in ways I didn't anticipate
How many requests per second do you get? I also see a lot of bot traffic but nowhere near to hit the servers significantly, and i render most stuff on the server directly.
There’s a lot of factors. Depends how well your content lends itself to being cached by a CDN, the tech you (or your predecessors) chose to build it with, and how many unique pages you have. Even with pretty aggressive caching, having a couple million pages indexed adds up real fast. Especially if you weren’t fortunate enough to inherit a project using a framework that makes server side rendering easy.
It's not "written too slow" if you e.g. only get 50 users a week, though. If bots add so much load that you need to go optimise your website for them, then that's a bot problem not a website problem.
Yes yes, definitely people don’t know what they’re doing and not that they’re operating on a scale or problem you are not. Metabrainz cannot cache all of these links as most of them are hardly ever hit. Try to assume good intent.
The worse thing is calendar/schedule. Many crawler tries to load every single day, with day view, week view and month view.
Those pages are dynamically generated and virtually limitless
stinky613|1 month ago
We ended up having to block entire ASNs and several subnets (lots from Facebook IPs, interestingly)
chao-|1 month ago
switz|1 month ago
Many of them don't even self-identify and end up scraping with shrouded user-agents or via bot-farms. I've had to block entire ASNs just to tone it down. It also hurts good-faith actors who genuinely want to build on top of our APIs because I have to block some cloud providers.
I would guess that I'm getting anywhere from 10-25 AI bot requests (maybe more) per real user request - and at scale that ends up being quite a lot. I route bot traffic to separate pods just so it doesn't hinder my real users' experience[0]. Keep in mind that they're hitting deeply cold links so caching doesn't do a whole lot here.
[0] this was more of a fun experiment than anything explicitly necessary, but it's proven useful in ways I didn't anticipate
tommek4077|1 month ago
account42|1 month ago
roblh|1 month ago
blell|1 month ago
Qwertious|1 month ago
tclancy|1 month ago
j16sdiz|1 month ago
kpcyrd|1 month ago
jjgreen|1 month ago