I will be using Pydoll for the following legitimate use case: a franchisee is given access to their data as controlled by the franchise through a web site. The franchisee uses browser automation to retrieve its data but now the franchise has deployed a WAF that blocks Chrome webdriver. This is not a public web site and the data is not public so it frustrates the franchisee because it just wants its data which is paid for by its franchisee fees.
Well it can be abused of course, but capthas are used abusively as well, so I would say it's fair game.
Lots of use cases for scraping are not DoS or information stealing, but mere automation.
Proof of work should be used in these cases, it deters massive scraping abuse by making it too expensive at scale, while allowing legitimate small scale automation.
I am also wondering about this, and in case you have a chef's knife in your kitchen, I would also like to hear if you have any comment on how that may be abused.
Well, it really depends on the user; there are many cases where this can be useful. Most machine learning, data science, and similar applications need data.
It's been a bit, but I'm pretty sure use of CDP can be detected. Has anything changed on that front, or are you aware and you're just bypassing with automated captcha handling?
CDP itself is not detectable. It turns out that other libraries like puppeteer and playwright often leave obvious traces, like create contexts with common prefixes, defining attributes in the navigator property.
I did a clean implementation on top of the CDP, without many signals for tracking. I added realistic interactions, among other measures.
> Say goodbye to webdriver compatibility nightmares
That's cool but Chrome is the only browser I have had these issues with. We have a cron process that uses selenium, initially with Chrome, and every time there was a chrome browser update we had to update the web driver. I switched it to Firefox and haven't had to update the web driver since.
I like the async portion of this but this seems like MechanicalSoup?
*EDIT* MechanicalSoup doesn't necessarily have async, AFAIK.
I don't think it's similar. The library has many other features that Selenium doesn't have. It has few dependencies, which makes installation faster, allows scraping multiple tabs simultaneously because it’s async, and has a much simpler syntax and element searching, without all the verbosity of Selenium. Even for cases that don’t involve captchas, I still believe it’s definitely worth using.
As someone who uses ISPs and browser configurations that seem to frustrate CloudFlare/reCaptcha to the point of frequently having to solve them during day-to-day browsing, it would be interesting to develop a proxy server that could automatically/transparently solve captchas for me.
renegat0x0|8 months ago
This is something that was very useful for me not to setup selenium for the x time. I just use one crawling server for my projects.
Link:
https://github.com/rumca-js/crawler-buddy
thalissonvs|8 months ago
jdnier|8 months ago
voidmain0001|8 months ago
Galanwe|8 months ago
Lots of use cases for scraping are not DoS or information stealing, but mere automation.
Proof of work should be used in these cases, it deters massive scraping abuse by making it too expensive at scale, while allowing legitimate small scale automation.
mannyv|8 months ago
bobajeff|8 months ago
wesselbindt|8 months ago
thalissonvs|8 months ago
xxxthrowawayxxx|8 months ago
[deleted]
mfrye0|8 months ago
It's been a bit, but I'm pretty sure use of CDP can be detected. Has anything changed on that front, or are you aware and you're just bypassing with automated captcha handling?
thalissonvs|8 months ago
I did a clean implementation on top of the CDP, without many signals for tracking. I added realistic interactions, among other measures.
hk1337|8 months ago
That's cool but Chrome is the only browser I have had these issues with. We have a cron process that uses selenium, initially with Chrome, and every time there was a chrome browser update we had to update the web driver. I switched it to Firefox and haven't had to update the web driver since.
I like the async portion of this but this seems like MechanicalSoup?
*EDIT* MechanicalSoup doesn't necessarily have async, AFAIK.
thalissonvs|8 months ago
VladVladikoff|8 months ago
at0mic22|8 months ago
nickspacek|8 months ago
at0mic22|8 months ago
whall6|8 months ago
bobbyraduloff|8 months ago
pokemyiout|8 months ago
However, I did find this for their CF Turnstile bypass [2]:
[1] https://autoscrape-labs.github.io/pydoll/deep-dive/[2] https://github.com/autoscrape-labs/pydoll/blob/5fd638d68dd66...
thalissonvs|8 months ago
unknown|8 months ago
[deleted]
antiloper|8 months ago
[deleted]