I'm sure they have, but not all of them have the means (and time) to bypass some of the measures, defeating single-source-IP[1] bots and script-kiddies can cut down on the amount of bots.
It's an arms-race that I've fought on both sides of: deploying anti-abuse measures and writing a very well-behaved scraper; unfortunately some sites block all scrapers despite their behavior, and you have to figure out how the detection is done empirically and enjoy your short-lived victories while they last. On the defensive side, you continually have to update your detection heuristics, and the short-lived victories.
1. I also have a read-only Twitter archiving bot that doesn't hide its nature (the handle is "${NOUN}bot"). It runs from a single VPS at regular cron intervals. It'd be trivial for Twitter to detect this and make it not worth my while.
sangnoir|4 years ago
It's an arms-race that I've fought on both sides of: deploying anti-abuse measures and writing a very well-behaved scraper; unfortunately some sites block all scrapers despite their behavior, and you have to figure out how the detection is done empirically and enjoy your short-lived victories while they last. On the defensive side, you continually have to update your detection heuristics, and the short-lived victories.
1. I also have a read-only Twitter archiving bot that doesn't hide its nature (the handle is "${NOUN}bot"). It runs from a single VPS at regular cron intervals. It'd be trivial for Twitter to detect this and make it not worth my while.