This guide (and most other guides) are missing a massive tip: Separate the crawling (finding urls and fetching the HTML content) from the scraping step (extracting structured data out of the HTML).
More than once, I wrote a scraper that did both of these steps together. Only later I realized that I forgot to extract some information that I need and had to do the costly task of re-crawling and scraping everything.
If you do this in two steps, you can always go back, change the scraper and quickly rerun it on historical data instead of re-crawling everything from scratch.
What I find most effective, is to wrap `get` with local cache, and this is the first thing I write when I start a web crawling project. Therefore, from the very beginning, even when I'm just exploring and experimenting, every page only gets downloaded once to my machine. This way I don't end up accidentally bother the server too much, and I don't have to re-crawl if I make a mistake in code.
I send the URLs I want scraped to Urlbox[0] it renders the pages saves HTML (and screenshot and metadata) to my S3 bucket[1]. I get a webhook[2] when it's ready for me to process.
I prefer to use Ruby so Nokogiri[3] is the tool I use for scraping step.
This has been particularly useful when I've want to scrape some pages live from a web app and don't want to manage running Puppeteer or Playwright in production.
Disclosure: I work on Urlbox now but I also did this in the five years I was a customer before joining the team.
I've found this to be a good practice for ETL in general. Separate the steps, and save the raw data from "E" if you can because it makes testing and verifying "T" later much easier.
An easy way to do this that I've used is to cache web requests. This way, I can run the part of the code that gets the data again with say a modification to grab data from additional urls, and I'm not unnecessarily rerunning my existing URLs. With this method, I don't need to modify existing code either, best of both worlds.
My Clojure scraping framework [0] facilitates that kind of workflow, and I’ve been using it to scrape/restructure massive sites (millions of pages). I guess I’m going to write a blog post about scraping with it at scale. Although it doesn’t really scale much above that – it’s meant for single-machine loads at the moment – it could be enhanced to support that kind of workflow rather easily.
I've found this approach works really well using JavaScript and puppeteer for the first stage, and then Python for the second stage (the re module for regular expressions is nice here IMO).
JS/puppeter seems a bit easier for things like rotating user agents, from article:
> "Websites often block scrapers via blocked IP ranges or blocking characteristic bot activity through heuristics. Solutions: Slow down requests, properly mimic browsers, rotate user agents and proxies."
Can confirm. A few discrete scripts each focused on one part of the process can make the whole thing run seamlessly async, and you naturally end up storing the pages for processing by subsequent scripts. Especially if you write a dedicated downloader - then you can really go nuts optimizing and randomizing the download parameters for each individual link in the queue. "Do one thing and do it well" FTW.
Although in general I like the idea of a queue for a scraper to access separately, another option - assuming you have the storage and bandwidth - is to capture and store every requested page, which lets you replay the extraction step later.
if you're using requests in python, requests-cache does exactly this for you, saving the data to an sqlite db, and is compatible with your code using requests.
I strongly recommend adding Playwright to your set of tools for Python web scraping. It's by far the most powerful and best designed browser automation tool I've ever worked with.
I understand that using Playwright in tests is probably the most common use case (it's even in their tagline) but ultimately the introduction section of a lib should be about the lib itself, not certain scenario to use it with a 3rd-party lib B (`pytest`). Especially when it may cause side effect (I wasn't "bitten" by it but surely was confusing: when I was learning it before, I created test_example.py as said in a minefield folder which has batch of other test_xxxx.py files. And running `pytest` causes all of them to run, and gives confusing outputs. And it's not obvious to me at all, since I've never used pytest before and this is not a documentation about pytest, so no additional context was given.)
I actually use your shot-scraper tool (coupled with Mozilla's Readability) to extract the main text of a site (to convert to audio and listen via a podcast player). I love it!
Some caveats though:
- It does fail on some sites. I think the value of scrapy is you get more fine grained control. Although I guess if you can use any JS with shot-scraper you could also get that fine grained control.
- It's slow and uses up a lot of CPU (because Playwright is slow and uses up a lot of CPU). I recently used shot-scraper to extract the text of about 90K sites (long story). Ran it on 22 cores, and the room got very hot. I suspect Scrapy would use an order of magnitude less power.
On the plus side, of course, is the fact that it actually executes JS, so you can get past a lot of JS walls.
Can anyone recommend a good methodology for writing tests against a Playwright scraping project?
I have a relatively sophisticated scraping operation going, but I haven’t found a great way to test methods that are dependent on JavaScript interaction behind a login.
I’ve used Playwright’s har recording to great effect for writing tests that don’t require login, but I’ve found that har recording doesn’t get me there for post-login because the har playback keeps serving the content from pre-login (even though it includes the relevant assets from both pre and post login.)
1. <domain>/robots.txt can sometimes have useful info for scraping a website. It will often include links to sitemaps that let you enumerate all pages on a site. This is a useful library for fetching/parsing a sitemap (https://github.com/mediacloud/ultimate-sitemap-parser)
2. Instead of parsing HTML tags, sometimes you can extract the data you need through structured metadata. This is a useful library for extracting it into JSON (https://github.com/scrapinghub/extruct)
This. A lot of modern sites can be really easy to scrape. Lots of machine-readable data.
APIs (for SPAs), OpenGraph/LD+JSON data in <head>, and data- attributes with proper data in them (e.g. a timestamp vs "just now" in the text for the human).
Always funny seeing SaaS companies pitch their own product in blog posts. I understand it's just how marketing works, but pitching your own product as a solution to a problem (that you yourself are introducing, perhaps the first time to a novice reader) never fails to amuse me.
I'm not sure why Python web scraping is so popular compared to Node.js web scraping. npm has some very well made packages for DOM parsing, and since it's in Javascript we have more native feeling DOM features (e.g. node-html-parser using querySelector instead of select - it just feels a lot more intuitive). It's super easy to scrape with Puppeteer or just regular html parsers on a Lambda.
I would prefer if people would not submit sites to HN that are using anti-scraping tactics such as DataDome that block ordinary web users making a single HTTP request using a non-popular, smaller, simpler client.
One example is www.reuters.com. It makes no sense because the site works fine without Javascript but Javascript is required as a result of the use of DataDome. See below for example demonstration.
For anyone who is doing the scraping that causes these websites to use hacks like DataDome: Does your scraping solution get blocked by DataDome. I suspect many will answer no, indicating to me that DataDome is not effective at anything more than blocking non-popular clients. To be more specific, there seems to be a blurring of the line between blocking non-popular clients and preventing "scraping". If scraping can be accomplished with the gigantic, complex popular clients, then why block the smaller, simpler non-popular clients that make a single HTTP request.
To browse www.reuters.com text-only in a gigantic, complex, popular browser
1. Clear all cookies
2. Allow Javascript in Settings for the site
ct.captcha-delivery.com
3. Block Javascript for the site
www.reuters.com
4. Block images for the site
www.reuters.com
First try browsing www.reuters.com with these settings. Two cookies will be stored. One from reuters.com. This one is the DataDome cookie. And another one from www.reuters.com. This second cookie can be deleted with no effect on browsing.
NB. No ad blocker is needed.
Then clear the cookies, remove the above settings and try browsing www.reuters.com with Javascript enabled for all sites and again without an ad blocker. This is what DataDome and Reuters ask web users to do:
"Please enable JS and disable any ad blocker."
Following this instruction from some anonymous web developer totally locks up the computer I am using. The user experience is unbearable.
Whereas with the above settings I used for the demonstration, browsing and reading is fast.
I thought scraping is kind of dead given all the CAPTCHAs and auth walls everywhere. The article does mention proxies and rate limiting, but could anyone with (recent) practical experience elaborate on dealing with such challenges?
If you have a decent gpu (16gb+ vram) and are using Linux, then this tool I wrote some days ago might do the trick. (at least for googles recaptcha). Also, for now, you have to call the main.py every time you see a captcha on a site and you need the gui since I am only using vision via Screenshots, no HTML or similar. (Sorry that it's not yet that well optimized. I am currently very busy with lots of other things, but next week I should have time to improve this further. But it should still work for basic scraping.) https://github.com/notune/captcha-solver/
1. use mobile phone proxies. Because of how mobile phone networks do NAT, basically it means that thousands of people share IPs and are much less like to get blocked.
2. Reverse engineer APIs if the data you want is returned in an ajax call.
3. Use a captcha solving service to defeat captchas. There's many and they are cheap.
4. Use an actual phone or get really good at convincing the server you are a mobile phone.
5. Buy 1000s of fake emails to simulate multiple accounts.
6. Experiment. Experiment. Experiment. Get some burner accounts. Figure out if they have request per min/hour/day throttling. See what behavior triggers a cloudflare captchas. Check if different variables such as email domain, useragent, voip vs non-voip sms based 2fa. your goal is to simulate a human. So if you sequentially enumerate through every document - that might be what get's you flagged.
Flyscrape[0] eliminates a lot of boilerplate code that is otherwise necessary when building a scraper from scratch, while still giving you the flexibility to extract data that perfectly fit your needs.
It comes as a single binary executable and runs small JavaScript files without having to deal with npm or node (or python).
You can have a collection of small and isolated scraping scripts, rather than full on node (or python) projects.
Any modern web scraping set up is going to require browser agents. You will probably have to build your own tools to get anything from a major social media platform, or even NYT articles.
I've had to do a lot of scraping recently and something that really helps is https://pypi.org/project/requests-cache/ . It's a drop in replacement for the requests library but it caches all the responses to a sqlite database.
Really helps if you need to tweak your script and you're being rated limited by the sites you're scraping.
I've tried this for the first time recently in 10 years - it's really become a miserable chore. There are so many countermeasures deployed to web scraping. The best path forward I could imagine is utilizing LLMs, taking screenshots and having the AI tell me what it sees on the page; but even gathering links is difficult. xml site maps for the win.
Are scrapers written on a per-website basis? Are there techniques to separate content from menus / ads / filler / additional information, etc? How do people deal with design changes - is it by rewriting the scraper whenever this happens? Thanks!
Check out the cloudscraper library if are having speed/cpu issues with sites that require js/have cloudfare defending them. That plus a proxy list plus threading allows me to make 300 requests a minute across 32 different proxies. Recently implemented it for a project: https://github.com/rezaisrad/discogs/tree/main/src/managers
I'm convinced there is a gold mine sitting right in front of us ready to be picked by someone who can intelligently combine web scraping knowledge with LLMs e.g. scrape data, feed it into LLMs do get insights in an automated fashion. I don't know exactly what the final manifestation looks like but its there and will be super obvious when someone does it.
I feel that the more immediate and impactful opportunity that people are doing is instead of scraping to get/understand content. LLM agents can just interactively navigate websites and perform actions. Parsing/Scraping can be brittle with changes, but an LLM agent to perform an action can just follow steps to search, click on results, and navigate like a human would
I've been writing rudimentary Python scripts to scrape online recipe websites for my hobby cooking purposes, and I wish there was some general software that could do this more simply. One of the websites has started making their images unclickable, so measures like that make me think it might become harder to automatically fetch such content.
There was a similar guide on HN titled something like "how to scrape like the big boys" which dug into a setup using mobile IPs, racks of burner phones, and so on.
It's been lost to a bad bookmark setup of mine, and if anyone has a lead on that resource, please link, thank you and unlimited e-karma heading your way.
[+] [-] zopper|2 years ago|reply
More than once, I wrote a scraper that did both of these steps together. Only later I realized that I forgot to extract some information that I need and had to do the costly task of re-crawling and scraping everything.
If you do this in two steps, you can always go back, change the scraper and quickly rerun it on historical data instead of re-crawling everything from scratch.
[+] [-] powersnail|2 years ago|reply
[+] [-] jot|2 years ago|reply
I send the URLs I want scraped to Urlbox[0] it renders the pages saves HTML (and screenshot and metadata) to my S3 bucket[1]. I get a webhook[2] when it's ready for me to process.
I prefer to use Ruby so Nokogiri[3] is the tool I use for scraping step.
This has been particularly useful when I've want to scrape some pages live from a web app and don't want to manage running Puppeteer or Playwright in production.
Disclosure: I work on Urlbox now but I also did this in the five years I was a customer before joining the team.
[0]: https://urlbox.com [1]: https://urlbox.com/s3 [2]: https://urlbox.com/webhooks [3]: https://nokogiri.org
[+] [-] jjice|2 years ago|reply
[+] [-] aviperl|2 years ago|reply
For this I've used the requests-cache lib.
[+] [-] nathell|2 years ago|reply
My Clojure scraping framework [0] facilitates that kind of workflow, and I’ve been using it to scrape/restructure massive sites (millions of pages). I guess I’m going to write a blog post about scraping with it at scale. Although it doesn’t really scale much above that – it’s meant for single-machine loads at the moment – it could be enhanced to support that kind of workflow rather easily.
[0]: https://github.com/nathell/skyscraper
[+] [-] bbkane|2 years ago|reply
[+] [-] gdcbe|2 years ago|reply
[+] [-] photochemsyn|2 years ago|reply
JS/puppeter seems a bit easier for things like rotating user agents, from article:
> "Websites often block scrapers via blocked IP ranges or blocking characteristic bot activity through heuristics. Solutions: Slow down requests, properly mimic browsers, rotate user agents and proxies."
[+] [-] generalizations|2 years ago|reply
[+] [-] tussa|2 years ago|reply
[+] [-] nkozyra|2 years ago|reply
[+] [-] BiteCode_dev|2 years ago|reply
[+] [-] fragmede|2 years ago|reply
[+] [-] throwaway81523|2 years ago|reply
[+] [-] greenie_beans|2 years ago|reply
[+] [-] simonw|2 years ago|reply
I use it for my shot-scraper CLI tool: https://shot-scraper.datasette.io/ - which lets you scrape web pages directly from the command line by running JavaScript against pages to extract JSON data: https://shot-scraper.datasette.io/en/stable/javascript.html
[+] [-] thrdbndndn|2 years ago|reply
I understand that using Playwright in tests is probably the most common use case (it's even in their tagline) but ultimately the introduction section of a lib should be about the lib itself, not certain scenario to use it with a 3rd-party lib B (`pytest`). Especially when it may cause side effect (I wasn't "bitten" by it but surely was confusing: when I was learning it before, I created test_example.py as said in a minefield folder which has batch of other test_xxxx.py files. And running `pytest` causes all of them to run, and gives confusing outputs. And it's not obvious to me at all, since I've never used pytest before and this is not a documentation about pytest, so no additional context was given.)
> tagline
[+] [-] BeetleB|2 years ago|reply
Some caveats though:
- It does fail on some sites. I think the value of scrapy is you get more fine grained control. Although I guess if you can use any JS with shot-scraper you could also get that fine grained control.
- It's slow and uses up a lot of CPU (because Playwright is slow and uses up a lot of CPU). I recently used shot-scraper to extract the text of about 90K sites (long story). Ran it on 22 cores, and the room got very hot. I suspect Scrapy would use an order of magnitude less power.
On the plus side, of course, is the fact that it actually executes JS, so you can get past a lot of JS walls.
[+] [-] thundergolfer|2 years ago|reply
Agree that Playwright is great. It's super easy to run on Modal.[2]
1. https://modal.com/docs/guide/workspaces#dashboard
2. https://modal.com/docs/examples/web-scraper#a-simple-web-scr...
[+] [-] tnolet|2 years ago|reply
[+] [-] sam2426679|2 years ago|reply
I have a relatively sophisticated scraping operation going, but I haven’t found a great way to test methods that are dependent on JavaScript interaction behind a login.
I’ve used Playwright’s har recording to great effect for writing tests that don’t require login, but I’ve found that har recording doesn’t get me there for post-login because the har playback keeps serving the content from pre-login (even though it includes the relevant assets from both pre and post login.)
[+] [-] 3abiton|2 years ago|reply
[+] [-] Oras|2 years ago|reply
[+] [-] sakisv|2 years ago|reply
[+] [-] zffr|2 years ago|reply
1. <domain>/robots.txt can sometimes have useful info for scraping a website. It will often include links to sitemaps that let you enumerate all pages on a site. This is a useful library for fetching/parsing a sitemap (https://github.com/mediacloud/ultimate-sitemap-parser)
2. Instead of parsing HTML tags, sometimes you can extract the data you need through structured metadata. This is a useful library for extracting it into JSON (https://github.com/scrapinghub/extruct)
[+] [-] deanishe|2 years ago|reply
APIs (for SPAs), OpenGraph/LD+JSON data in <head>, and data- attributes with proper data in them (e.g. a timestamp vs "just now" in the text for the human).
Scraping is a lot easier than it used to be.
[+] [-] evilsaloon|2 years ago|reply
[+] [-] 65|2 years ago|reply
[+] [-] aaroninsf|2 years ago|reply
PLEASE PLEASE PLEASE establish and use a consistent useragent string.
This lets us load balance and steer traffic appropriately.
Thank you.
[+] [-] 1vuio0pswjnm7|2 years ago|reply
One example is www.reuters.com. It makes no sense because the site works fine without Javascript but Javascript is required as a result of the use of DataDome. See below for example demonstration.
For anyone who is doing the scraping that causes these websites to use hacks like DataDome: Does your scraping solution get blocked by DataDome. I suspect many will answer no, indicating to me that DataDome is not effective at anything more than blocking non-popular clients. To be more specific, there seems to be a blurring of the line between blocking non-popular clients and preventing "scraping". If scraping can be accomplished with the gigantic, complex popular clients, then why block the smaller, simpler non-popular clients that make a single HTTP request.
To browse www.reuters.com text-only in a gigantic, complex, popular browser
1. Clear all cookies
2. Allow Javascript in Settings for the site
ct.captcha-delivery.com
3. Block Javascript for the site
www.reuters.com
4. Block images for the site
www.reuters.com
First try browsing www.reuters.com with these settings. Two cookies will be stored. One from reuters.com. This one is the DataDome cookie. And another one from www.reuters.com. This second cookie can be deleted with no effect on browsing.
NB. No ad blocker is needed.
Then clear the cookies, remove the above settings and try browsing www.reuters.com with Javascript enabled for all sites and again without an ad blocker. This is what DataDome and Reuters ask web users to do:
"Please enable JS and disable any ad blocker."
Following this instruction from some anonymous web developer totally locks up the computer I am using. The user experience is unbearable.
Whereas with the above settings I used for the demonstration, browsing and reading is fast.
[+] [-] thijsvandien|2 years ago|reply
[+] [-] caesil|2 years ago|reply
The CAPTCHAs and walls are more of a desperate, doomed retreat.
[+] [-] leumon|2 years ago|reply
[+] [-] pocket_cheese|2 years ago|reply
1. use mobile phone proxies. Because of how mobile phone networks do NAT, basically it means that thousands of people share IPs and are much less like to get blocked.
2. Reverse engineer APIs if the data you want is returned in an ajax call.
3. Use a captcha solving service to defeat captchas. There's many and they are cheap.
4. Use an actual phone or get really good at convincing the server you are a mobile phone.
5. Buy 1000s of fake emails to simulate multiple accounts.
6. Experiment. Experiment. Experiment. Get some burner accounts. Figure out if they have request per min/hour/day throttling. See what behavior triggers a cloudflare captchas. Check if different variables such as email domain, useragent, voip vs non-voip sms based 2fa. your goal is to simulate a human. So if you sequentially enumerate through every document - that might be what get's you flagged.
Best of luck and happy scraping!
[+] [-] philippta|2 years ago|reply
Flyscrape[0] eliminates a lot of boilerplate code that is otherwise necessary when building a scraper from scratch, while still giving you the flexibility to extract data that perfectly fit your needs.
It comes as a single binary executable and runs small JavaScript files without having to deal with npm or node (or python).
You can have a collection of small and isolated scraping scripts, rather than full on node (or python) projects.
[0]: https://github.com/philippta/flyscrape
[+] [-] cnqso|2 years ago|reply
[+] [-] DishyDev|2 years ago|reply
Really helps if you need to tweak your script and you're being rated limited by the sites you're scraping.
[+] [-] 1-6|2 years ago|reply
[+] [-] justinzollars|2 years ago|reply
[+] [-] moritonal|2 years ago|reply
Only step you missed was embeddings to avoid all the privacy pages, and a cookie banner blocker (which arguably the AI could navigate if I cared).
[+] [-] brianarbuckle|2 years ago|reply
import pandas as pd
tables = pd.read_html('https://commons.wikimedia.org/wiki/List_of_dog_breeds', extract_links="all")
tables[-1]
[+] [-] givemeethekeys|2 years ago|reply
[+] [-] SinjonSuarez|2 years ago|reply
[+] [-] thomasisaac|2 years ago|reply
Couldn't recommend them more.
[+] [-] bilater|2 years ago|reply
[+] [-] bfeynman|2 years ago|reply
[+] [-] generalizations|2 years ago|reply
[+] [-] calf|2 years ago|reply
[+] [-] dogman144|2 years ago|reply
It's been lost to a bad bookmark setup of mine, and if anyone has a lead on that resource, please link, thank you and unlimited e-karma heading your way.