I will be using Pydoll for the following legitimate use case: a franchisee is given access to their data as controlled by the franchise through a web site. The franchisee uses browser automation to retrieve its data but now the franchise has deployed a WAF that blocks Chrome webdriver. This is not a public web site and the data is not public so it frustrates the franchisee because it just wants its data which is paid for by its franchisee fees.
Well it can be abused of course, but capthas are used abusively as well, so I would say it's fair game.
Lots of use cases for scraping are not DoS or information stealing, but mere automation.
Proof of work should be used in these cases, it deters massive scraping abuse by making it too expensive at scale, while allowing legitimate small scale automation.
I am also wondering about this, and in case you have a chef's knife in your kitchen, I would also like to hear if you have any comment on how that may be abused.
Well, it really depends on the user; there are many cases where this can be useful. Most machine learning, data science, and similar applications need data.
You know that the captcha is there to prevent you from doing e.g. automated data mining, depends on the site obviously. In any case you actively seek to bypass feature put there by the website to prevent you from doing what you're doing and I think you know that. Does that not give you any moral concerns?
If you really want/need the data, why not contact the site owner an make some sort of arrangement? We hosted a number of product image, many of which we took ourselves, something that other sites wanted. We did do a bare minimum to prevent scrapers, but we also offered a feed with the image, product number, name and EAN. We charged a small fee, but you then got either an XML feed or a CSV and you could just pick out the new additions and download those.
>Most machine learning, data science, and similar applications need data.
So. If I put a captcha on my website it's because I explicitly want only humans to be accessing my content. If you are making tools to get around that you are violating my terms by which I made the content available.
No one should need a captcha. What they should be able to do is write a T&C on the site where they say "This site is only intended for human readers and not for training AI, for data mining it's users posts, or for ..... and if you do use it for any of these you agree to pay me $100,000,000,000." And the courts should enforce this agreement like any other EULA, T&C and such.
voidmain0001|8 months ago
Galanwe|8 months ago
Lots of use cases for scraping are not DoS or information stealing, but mere automation.
Proof of work should be used in these cases, it deters massive scraping abuse by making it too expensive at scale, while allowing legitimate small scale automation.
mannyv|8 months ago
e9a8a0b3aded|8 months ago
bobajeff|8 months ago
mrweasel|8 months ago
wesselbindt|8 months ago
nhinck2|8 months ago
thalissonvs|8 months ago
mrweasel|8 months ago
If you really want/need the data, why not contact the site owner an make some sort of arrangement? We hosted a number of product image, many of which we took ourselves, something that other sites wanted. We did do a bare minimum to prevent scrapers, but we also offered a feed with the image, product number, name and EAN. We charged a small fee, but you then got either an XML feed or a CSV and you could just pick out the new additions and download those.
wang_li|8 months ago
So. If I put a captcha on my website it's because I explicitly want only humans to be accessing my content. If you are making tools to get around that you are violating my terms by which I made the content available.
No one should need a captcha. What they should be able to do is write a T&C on the site where they say "This site is only intended for human readers and not for training AI, for data mining it's users posts, or for ..... and if you do use it for any of these you agree to pay me $100,000,000,000." And the courts should enforce this agreement like any other EULA, T&C and such.
xxxthrowawayxxx|8 months ago
[deleted]