top | item 44114066

(no title)

captainmuon | 9 months ago

As somebody who does some scraping / crawling for legitimate uses, I'm really unhappy with this development. I understand people have valid cases why they don't want their content scraped. Maybe they want to sell it - I can understand that, although I don't like it. Maybe they are opposed to it for fundamental reasons. I for one would like my content to be spread maximally. I want my arguments to be incorporated into AIs, so I can reach more people. But of course that is just me when I'd write certain content, others have different goals.

It gets annoying when you have the right to scrape something - either because the owner of the data gave you the OK or because it is openly licensed. But then the webmaster can't be bothered to relax the rate limiter for you, and nobody can give you a nice API. Now people are putting their Open Educational Resources, their open source software, even their freaking essays about openness that they want the world to read behind Anubis. It makes me shake my head.

I understand perfectly it is annoying when badly written bots hammer your site. But maybe then HTTP and those bots are the problem. Maybe we should make it easier for site owners to push their content somewhere where we can scrape it easier?

discuss

order

Analemma_|9 months ago

If you scrape at a reasonable rate and don't clear session cookies, your scraper can solve the Anubis POW same as a user and you're fine. Anubis is for distributed scrapers which make requests at absurd rates.

berkes|9 months ago

Sounds like something IPFS could be nice solution for.

yladiz|9 months ago

> I understand people have valid cases why they don't want their content scraped. Maybe they want to sell it - I can understand that, although I don't like it.

To be frank: it’s not your content, it’s theirs, and it doesn’t matter if you like it or not, they can decide what they want to do with it, you’re not entitled to it. Yes there are some cases that you personally have permission to scrape, or the license explicitly permits it, but this isn’t the norm.

The bigger issue isn’t that people don’t want their content to be read it’s that they want it to be read and consumed by a human in most cases, and they want their server resources (network bandwidth, CPU, etc) to be used in a manageable way. If these bots were written to be respectful, then maybe we wouldn’t be in this situation. These bots poisoned the well, and they affect respectful bots because of their actions.