top | item 42445575

(no title)

I've been working with scrapers quite a lot. I started with python requests, then to scrapy, then selenium, then selenium via undetected_chromedriver, and once that started being detected during a chrome update about a year ago, I've switched over to seleniumbase. It got by undetected, but to get it working with pre-downloaded drivers, I had to look into the code. I have never, and I mean never, in all my python years, seen such a horrible mess of code. We are talking 1000lines long methods, with 20-30 different flags and branches Just horrible. I have since switched to Playwright, which seems to be also undetected, and offers a much saner interface.

discuss

seleniumbase|1 year ago

SeleniumBase modifies the webdriver so that it doesn't get detected when used alongside the CDP stealth mode and methods. It'll download chromedriver for you. Not sure what you mean by the multiple branches, as there's just the primary one. What 1000-line methods are you referring to? By "flags", do you mean the different command-line options available? As for Playwright, they aren't undetected: See https://github.com/microsoft/playwright/issues/23884#issueco... - "Playwright is an end-to-end testing framework, where we expect you test on your own environments. Bypassing any form of bot protection is not something we can act on. Thanks for your understanding." On the contrary, SeleniumBase is OK with bypassing bot detection: https://github.com/seleniumbase/SeleniumBase/blob/master/exa...

cyanmagenta|1 year ago

Not the commenter, but “multiple branches” in this context is referring to if/else statements in the code, not source-control branches. Similarly, “flags” is referring to function arguments like a boolean “is_original.” More generally, they are just saying that the code has long, complicated, bug-prone functions.

That said, I just spent a few minutes browsing the SeleniumBase repro, and honestly it didn’t seem that unusual to me. Would be interested in seeing a specific example of what the commenter had in mind.

mdaniel|1 year ago

rather than point-by-point rebuttal as the sibling requests, I think this sums up the coding style pretty well: https://github.com/seleniumbase/SeleniumBase/blob/v4.33.11/s...

harrall|1 year ago

That's not amazing code but that's not that bad. In the grand scheme of things, that's not code debt that would ever seriously make my life any harder.

seleniumbase|1 year ago

That method came from code that I accepted in a PR from December 31, 2019: https://github.com/seleniumbase/SeleniumBase/pull/459 Not a true representation of most of the code today.

bryanrasmussen|1 year ago

Maybe I am just a cynic but I would expect Playwright to be detected when using Chrome, I mean I would expect it was to the benefit of Google to make that happen for the sake of making reCaptcha detect bots better.

That's actually why I've been scrapping my Playwright automation (because I expect I will encounter problems even if hasn't happened yet, cynical and paranoid) and moving towards writing a browser extension to automate Firefox.

Basically my use case is automating tedious things for myself not running bots at scale, so that's why it is imperative not to get caught being "not human", because then risk account problems.

robertlagrant|1 year ago

How can Google make that happen? Playwright's made by Microsoft. It can use Firefox as a browser as well as Chrome.

pryelluw|1 year ago

Enterprise Python code. Somehow ends up being worse than Java enterprise code. I’m too used to it at this point.

seleniumbase|1 year ago

The "Python vs Java" debate is probably one for a different Hacker News post. :)

edm0nd|1 year ago

Not sure if you have explored rolling captcha solving services into your code. Its easy as fuck and you can do it in a few lines of code. Check out DeathByCaptcha or AntiCaptcha. It's like $2.99 per 1,000 successfully solved captchas.

I guess my point is, you dont have to be undetected nor write 1000 lines of code to scrape or do whatever you are needing to do always. Saved me a ton of headaches and time when captchas are involved.

mintzworld|1 year ago

SeleniumBase is free, open-source, can bypass CAPTCHAs with a few lines of code, and it works from the free tier of GitHub Actions.