top | item 44236926

Show HN: PyDoll – Async Python scraping engine with native CAPTCHA bypass

136 points| thalissonvs | 8 months ago |github.com

39 comments

order

renegat0x0|8 months ago

I think I will add this to my AIO package. My project allows to crawl pages. Provides a barebones page, and scraping results are passed as JSON.

This is something that was very useful for me not to setup selenium for the x time. I just use one crawling server for my projects.

Link:

https://github.com/rumca-js/crawler-buddy

jdnier|8 months ago

Hi, just wondering what you're thinking about how your tool might be abused.

voidmain0001|8 months ago

I will be using Pydoll for the following legitimate use case: a franchisee is given access to their data as controlled by the franchise through a web site. The franchisee uses browser automation to retrieve its data but now the franchise has deployed a WAF that blocks Chrome webdriver. This is not a public web site and the data is not public so it frustrates the franchisee because it just wants its data which is paid for by its franchisee fees.

Galanwe|8 months ago

Well it can be abused of course, but capthas are used abusively as well, so I would say it's fair game.

Lots of use cases for scraping are not DoS or information stealing, but mere automation.

Proof of work should be used in these cases, it deters massive scraping abuse by making it too expensive at scale, while allowing legitimate small scale automation.

mannyv|8 months ago

Gee, I have this computer thing. How can it be abused?

bobajeff|8 months ago

Hi, as a non-webdev I want to know if rate limiting wouldn't make this a non concern?

wesselbindt|8 months ago

I am also wondering about this, and in case you have a chef's knife in your kitchen, I would also like to hear if you have any comment on how that may be abused.

thalissonvs|8 months ago

Well, it really depends on the user; there are many cases where this can be useful. Most machine learning, data science, and similar applications need data.

mfrye0|8 months ago

Checking it out and I see you're using CDP.

It's been a bit, but I'm pretty sure use of CDP can be detected. Has anything changed on that front, or are you aware and you're just bypassing with automated captcha handling?

thalissonvs|8 months ago

CDP itself is not detectable. It turns out that other libraries like puppeteer and playwright often leave obvious traces, like create contexts with common prefixes, defining attributes in the navigator property.

I did a clean implementation on top of the CDP, without many signals for tracking. I added realistic interactions, among other measures.

hk1337|8 months ago

> Say goodbye to webdriver compatibility nightmares

That's cool but Chrome is the only browser I have had these issues with. We have a cron process that uses selenium, initially with Chrome, and every time there was a chrome browser update we had to update the web driver. I switched it to Firefox and haven't had to update the web driver since.

I like the async portion of this but this seems like MechanicalSoup?

*EDIT* MechanicalSoup doesn't necessarily have async, AFAIK.

thalissonvs|8 months ago

I don't think it's similar. The library has many other features that Selenium doesn't have. It has few dependencies, which makes installation faster, allows scraping multiple tabs simultaneously because it’s async, and has a much simpler syntax and element searching, without all the verbosity of Selenium. Even for cases that don’t involve captchas, I still believe it’s definitely worth using.

VladVladikoff|8 months ago

I had the same problem and just added a few lines of code which check the version and update it if required.

at0mic22|8 months ago

This one is not using webdrive, but raw chrome debugging protocol

nickspacek|8 months ago

As someone who uses ISPs and browser configurations that seem to frustrate CloudFlare/reCaptcha to the point of frequently having to solve them during day-to-day browsing, it would be interesting to develop a proxy server that could automatically/transparently solve captchas for me.

at0mic22|8 months ago

cloudflare captcha can be easily passed with browser extension, not much different from the suggested bypass

whall6|8 months ago

The web scraping arms race continues.

bobbyraduloff|8 months ago

Is there a write up on how you deal with the captchas?

pokemyiout|8 months ago

I was also interested in this and couldn't find more information in the docs, even in the deep dive [1].

However, I did find this for their CF Turnstile bypass [2]:

    async def _bypass_cloudflare(
        self,
        event: dict,
        custom_selector: Optional[tuple[By, str]] = None,
        time_before_click: int = 2,
        time_to_wait_captcha: int = 5,
    ):
        """Attempt to bypass Cloudflare Turnstile captcha when detected."""
        try:
            selector = custom_selector or (By.CLASS_NAME, 'cf-turnstile')
            element = await self.find_or_wait_element(
                *selector, timeout=time_to_wait_captcha, raise_exc=False
            )
            element = cast(WebElement, element)
            if element:
                # adjust the external div size to shadow root width (usually 300px)
                await self.execute_script('argument.style="width: 300px"', element)
                await asyncio.sleep(time_before_click)
                await element.click()
        except Exception as exc:
            logger.error(f'Error in cloudflare bypass: {exc}')

[1] https://autoscrape-labs.github.io/pydoll/deep-dive/

[2] https://github.com/autoscrape-labs/pydoll/blob/5fd638d68dd66...

thalissonvs|8 months ago

you can check the official documentation, there's a section 'Deep Dive'