(no title)
welanes | 9 months ago
2. Outsourcing the task to one of the many CAPTCHA-solving services (2Captcha etc) – better
3. Using a pool of reliable IP addresses so you don't encounter checkboxes or turnstiles – best
I run a web scraping startup (https://simplescraper.io) and this is usually the approach[0]. It has become more difficult, and I think a lot of the AI crawlers are peeing in the pool with aggressive scraping, which is making the web a little bit worse for everyone.
[0] Worth mentioning that once you're "in" past the captcha, a smart scraper will try to use fetch to access more pages on the same domain so you only need to solve a fraction of possible captchas.
nomilk|9 months ago
First time hearing of the fetch() approach! If I understand correctly, regular browser automation might typically involve making separate GET requests for each page. Whereas the fetch() strategy involves making a GET for the first page (just as with regular browser automation), then after satisfying cloudflare, rather than going on to the next GET request, use fetch(<url>) to retrieve the rest of the pages you're after.
This approach is less noisy/impact on the server and therefore less likely to get noticed by bot detection.
This is fascinating stuff. (I'd previously used very little javascript in scrapes, preferring ruby, R, or python but this may tilt my tooling preferences toward using more js)
therein|9 months ago
Tokumei-no-hito|9 months ago