top | item 46242458

(no title)

FieryMechanic | 2 months ago

The way most scrapers work (I've written plenty of them) is that you just basically get the page and all the links and just drill down.

discuss

order

conartist6|2 months ago

So the easiest strategy to hamper them if you know you're serving a page to an AI bot is simply to take all the hyperlinks off the page...?

That doesn't even sound all that bad if you happen to catch a human. You could even tell them pretty explicitly with a banner that they were browsing the site in no-links mode for AI bots. Put one link to an FAQ page in the banner since that at least is easily cached

FieryMechanic|2 months ago

When I used to build these scrapers for people, I would usually pretend to be a browser. This normally meant changing the UA and making the headers look like a read browser. Obviously more advanced techniques of bot detection technique would fail.

Failing that I would use Chrome / Phantom JS or similar to browse the page in a real headless browser.

tigranbs|2 months ago

And obviously, you need things fast, so you parallelize a bunch!

FieryMechanic|2 months ago

I was collecting UK bank account sort code numbers (to a buy a database at the time costs a huge amount of money). I had spent a bunch of time using asyncio to speed up scraping and wondered why it was going so slow, I had left Fiddler profiling in the background.