WingNews

stinky613|1 month ago

A year or two ago I personally encountered scraping bots that were scraping every possible resultant page from a given starting point. So if it scraped a search results page it would also scrape every single distinct combination of facets on that search (including nonsensical combinations e.g. products that match the filter "products where weight<2lbs AND weight>2lbs")

We ended up having to block entire ASNs and several subnets (lots from Facebook IPs, interestingly)

chao-|1 month ago

I have encountered this same issue with faceted search results and individual inventory listings.

switz|1 month ago

If you have a lot of pages, AI bots will scrape every single one on a loop - wiki's generally don't have anywhere near the number of pages as an incremented entity primary id. I have a few million pages on a tiny website and it gets hammered by AI bots all day long. I can handle it, but it's a nuisance and they're basically just scraping garbage (statistics pages of historical matches or user pages that have essentially no content).

Many of them don't even self-identify and end up scraping with shrouded user-agents or via bot-farms. I've had to block entire ASNs just to tone it down. It also hurts good-faith actors who genuinely want to build on top of our APIs because I have to block some cloud providers.

I would guess that I'm getting anywhere from 10-25 AI bot requests (maybe more) per real user request - and at scale that ends up being quite a lot. I route bot traffic to separate pods just so it doesn't hinder my real users' experience[0]. Keep in mind that they're hitting deeply cold links so caching doesn't do a whole lot here.

[0] this was more of a fun experiment than anything explicitly necessary, but it's proven useful in ways I didn't anticipate

tommek4077|1 month ago

How many requests per second do you get? I also see a lot of bot traffic but nowhere near to hit the servers significantly, and i render most stuff on the server directly.

account42|1 month ago

Even moderately sized wikis have a huge number of different page versions which can all be accessed individually.

roblh|1 month ago

There’s a lot of factors. Depends how well your content lends itself to being cached by a CDN, the tech you (or your predecessors) chose to build it with, and how many unique pages you have. Even with pretty aggressive caching, having a couple million pages indexed adds up real fast. Especially if you weren’t fortunate enough to inherit a project using a framework that makes server side rendering easy.

blell|1 month ago

In these discussions no one will admit this, but the answer is generally yes. Websites written in python and stuff like that.

Qwertious|1 month ago

It's not "written too slow" if you e.g. only get 50 users a week, though. If bots add so much load that you need to go optimise your website for them, then that's a bot problem not a website problem.

tclancy|1 month ago

Yes yes, definitely people don’t know what they’re doing and not that they’re operating on a scale or problem you are not. Metabrainz cannot cache all of these links as most of them are hardly ever hit. Try to assume good intent.

j16sdiz|1 month ago

The worse thing is calendar/schedule. Many crawler tries to load every single day, with day view, week view and month view. Those pages are dynamically generated and virtually limitless

kpcyrd|1 month ago

The API seems to be written in Perl: https://github.com/metabrainz/musicbrainz-server

jjgreen|1 month ago

Time for a vinyl-style Perl revival ...