(no title)
holstvoogd | 4 years ago
They follow links that are explicitly marked as do not follow, they do not even try to limit their rate, they spoof their user agent strings etc etc. These bots cause real problems and cost real money. I do not think that kind of misuse is ethical. In fact, using this tool to circumvent protections can turn your scraping into a DDOS attack, which I do not feel are ethical.
If your bot behaves itself though, public information is public imo. Just don't take down websites, respect rate limits and do not follow 'no-follow' links.
To give an idea of the size of the issue, we have websites for customers that have maybe 5 hits per minute from actual users. Then _suddenly_ you go to 500 hits/minute for a couple of hours, because some bot is trying to scrape a calendar and is now looking for events in 1850 or whatever. (Not the greatest software that these links are still there tbh, but that is out of my control.)
Or another situation, not entirely related, but interesting i think: A few years back for days on end 80% of our total traffic came from random IPs across china & request could be traced through HTTP referrers where the 'user' had apparently opened a page in one province, then traveled to the other side of China and clicked a link 2 hours later.
All these things are relatively easy to mitigate, but that doesn't make it ethical.
thekyle|4 years ago
The only thing that rel="nofollow" does is tell search engines not to use that link in their PageRank computation.
If you do want to block well-behaved crawlers from crawling parts of your site, the proper way to do that is to use robots.txt rules.
wu_187|4 years ago
The issue with bots hosted on AWS or any cloud for that matter is that as a web host you can't just block the IPs because legitimate traffic comes from them in the form of CMS plugins, backups, etc.
judge2020|4 years ago
twothamendment|4 years ago
Been there, done that - at least on the side of fixing it. Anyone who implements a calendar, don't make pervious and next links that let someone travel time forever.
I've always wondered how much bit traffic costs us - but never actually tried to figure it out. It is a good portion of our traffic - even when we block a lot of it.