>In recent years, the web has gotten very hostile to the lowly web scraper. It's a result of the natural progression of web technologies away from statically rendered pages to dynamic apps built with frameworks like React and CSS-in-JS.
Dunno, a lot of the time it actually makes scraping easier because the content that's not in the original source tends to be served up as structured data via XHR- JSON usually- you just need to take a look at the data you're interested in and if it's not in 'view-source', it's coming from somewhere else.
Browser based scraping makes sense when that data is heavily mangled or obfuscated, laden with captchas and other anti-scraping methods. Or if you're interested in if text is hidden, what position it's on the page etc.
Agreed! multiple times I wasted hours figuring out what selectors to use, but then remembered that I can just look at the network tab and have perfectly structured JSON data.
> the content that's not in the original source tends to be served up as structured data via XHR- JSON usually-
Yes, you can overwrite fetch and log everything that comes in or out of the page you're looking at. I do that in Tampermonkey but one can probably inject the same kind of script in Puppeteer.
I'm grateful that GraphQL proliferated, because I don't even have to scrape such resources - I just query.
A while ago, when I was looking for an apartment, I noticed that only the mobile app for a certain service allows for drawing the area of interest - the web version had only the option of looking in the area currently visible on the screen.
Or did it? Turns out it was the same GraphQL query with the area described as a GeoJSON object.
GeoJSON allows for disjointed areas, which was particularly useful in my case, because I had three of those.
I am still using Casperjs ameith phantomjs. Old tech. But works perfectly. Some scripts are running for 10 years on the same sites without ever made a change.
A bit of a tangent, but a long time ago I was kicked out of a Facebook group for what I considered to be completely made up reasons -- and what really got to me was that by being banned from it, I couldn't even point to the posts that had been actively misunderstood and distorted. I couldn't find anything in the cache files, so I saved a process dump of the still running Firefox and stitched the posts together from that. I stopped caring the as soon as I had my proof, but was still sheepishly proud that I managed to get it.
I used a technique like this a few years back in a production product ... We had an integration partner (who we had permission to integrate with) that offered a different api for integration partners than was used for their website but which was horribly broken and regularly gave out the wrong data. The api was broken but the data displayed on their web page was fine so someone on the team wrote a browser automation (using ruby and selenium!) to drive the browser through the series of pages needed to retrieve all the information required. Needless to say, this broke all the time as the page/css changed etc.
At some point I got pulled in and ran screaming away from selenium to puppeteer -- and quickly discovered the joy that is scripting the browser via natively supported api's and the chrome debugger protocol.
The partners web page happened to be implemented with the apollo graphql client and I came across the puppeteer api for scanning the javascript heap -- I realized that if I could find the apollo client instance in memory (buried as a local variable inside some function closure referenced within the web app) -- I could just use it myself to get the data I needed ... coded it up in an hour or so and it just worked ... super fun and effective way to write a "scraper"!
OnDocumentReady -> scan the heap for the needed object -> use it directly to get the data you need
The main/only difference here is that Puppeteer only supports Chromium, while Playwright support multiple browsers. CDP is the Chrome DevTools Protocol. Otherwise, as long as you're using Chrome in both, you get the same base protocol with a different API.
If you're interested in seeing puppeteer in action I started doing streams last month where I talk through my method. I’ll be posting a lot more since it's been very fun.
Overall puppeteer is great because you get to easily inject js scripts in a nice API. Selenium is great too but not as developed of a web scraping interface imo. Also puppeteer is a very optimized headless browser which is a given. What really matters is implementing a VPN proxy and storing your cookies during auth routines which I can get into if you have any questions about that.
Thanks for posting this again! It's a year later and I still haven't touched the web scraper in production which is great to reflect on. It seems running the Youtube command on the post is still producing the exact same data too.
Did you ever make another blog post about how to choose properties working backward from the visible data on the web page to the data structure containing said data?
Searching the heap manually is not working very well. The data I want is in a (very) long list of irrelevant values within a "strings" key. It might have something to do with the data on the page that I want to scrape being rendered by JavaScript.
I'm not a legal scholar and this isn't my area of expertise, but the final note has links out to a TechCrunch article about LinkedIn vs hiQ Labs Inc, which alludes to web scraping being legal, but the case wasn't decided for a few more months, and the court sided with Linkedin. What's the final verdict on web scraping vs creating fake accounts to get user information (which the case focused on)
Although you'd imagine screenshots would be easy to OCR reliably, it's not guaranteed to get everything correct.
It's not like you can rely on a dictionary to confirm you've correctly OCRed a post by "@4EyedJediO" - who knows if that's an O or a 0 at the end?
And if you're OCRing the title and view count of a youtube video, for example, you've got to take the page layout into account because there's a recommendations sidebar full of other titles with different view counts.
Most/all minifiers won't actually mangle object property names as those often have observable side effects. You want to grab all the keys for an object and do something different depending on the name of the key - you can no longer do that if the minifier has mangled all the object keys. Not to mention I imagine it would be significantly harder track all references to object keys across an application (as opposed to just local variables).
It's the JSON data payload that has unminified keys. Though YouTube is one of the few google sites that still use JSON, most use protocol buffers which generate JS interfaces which would indeed be mangled by minifiers.
> These properties were chosen manually by working backward from the visible data on the web page to the data structure containing said data (I'll dive into that process in another blog post)
That would seem to be the actually interesting/challenging part.
Anyone know of any research on generating HTML differentials against updated webpages and with automatic healing of wrappers / selectors or research on using LLMs with webscraping and how to reduce token usage while retaining context?
That's using https://shot-scraper.datasette.io/ to get just the document.body.innerText as a raw string, then piping that to gpt-3.5-turbo with a system prompt.
In terms of retaining context, I added a feature to my strip-tags tool where you can ask it to NOT strip specific tags - e.g.:
Cool! I use selenium to do phishing detection at my company and I use javascript declared variables as a source of data to analyse. It’s specially useful for links that are obfuscated by concatenating two variables into another one.
I think one of the most challenging part of web scraping is dealing with the website's anti-scraping measures, such as it needs to sign in, encountering 403 forbidden error, and reCAPTCHA.
Does anyone have more experience in handling that?
It's still a game of cat and mouse. Next step is for the website to store multiple instances of similarly structured data so you scrape the dummy unknowingly.
[+] [-] ricardo81|2 years ago|reply
Dunno, a lot of the time it actually makes scraping easier because the content that's not in the original source tends to be served up as structured data via XHR- JSON usually- you just need to take a look at the data you're interested in and if it's not in 'view-source', it's coming from somewhere else.
Browser based scraping makes sense when that data is heavily mangled or obfuscated, laden with captchas and other anti-scraping methods. Or if you're interested in if text is hidden, what position it's on the page etc.
[+] [-] Raed667|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] bambax|2 years ago|reply
Yes, you can overwrite fetch and log everything that comes in or out of the page you're looking at. I do that in Tampermonkey but one can probably inject the same kind of script in Puppeteer.
[+] [-] Tade0|2 years ago|reply
A while ago, when I was looking for an apartment, I noticed that only the mobile app for a certain service allows for drawing the area of interest - the web version had only the option of looking in the area currently visible on the screen.
Or did it? Turns out it was the same GraphQL query with the area described as a GeoJSON object.
GeoJSON allows for disjointed areas, which was particularly useful in my case, because I had three of those.
[+] [-] holoduke|2 years ago|reply
[+] [-] Levitating|2 years ago|reply
[+] [-] beardyw|2 years ago|reply
[+] [-] johnnyworker|2 years ago|reply
[+] [-] Alifatisk|2 years ago|reply
[+] [-] wmichelin|2 years ago|reply
[+] [-] ricardo81|2 years ago|reply
[+] [-] breatheoften|2 years ago|reply
At some point I got pulled in and ran screaming away from selenium to puppeteer -- and quickly discovered the joy that is scripting the browser via natively supported api's and the chrome debugger protocol.
The partners web page happened to be implemented with the apollo graphql client and I came across the puppeteer api for scanning the javascript heap -- I realized that if I could find the apollo client instance in memory (buried as a local variable inside some function closure referenced within the web app) -- I could just use it myself to get the data I needed ... coded it up in an hour or so and it just worked ... super fun and effective way to write a "scraper"!
OnDocumentReady -> scan the heap for the needed object -> use it directly to get the data you need
[+] [-] paulddraper|2 years ago|reply
Every modern browser has native support for WebDriver (i.e. Selenium) APIs.
The advantage to Puppetteer is that the Chrome Dev Tools API is just a better API.
[+] [-] visarga|2 years ago|reply
[+] [-] simonw|2 years ago|reply
EDIT: https://github.com/adriancooney/puppeteer-heap-snapshot/blob... is the code that captures the snapshot, and it uses createCDPSession() - it looks like Playwright has an equivalent for that Puppeteer API, documented here: https://playwright.dev/docs/api/class-cdpsession
[+] [-] None4U|2 years ago|reply
[+] [-] jawerty|2 years ago|reply
I have a live coding stream I did the other day scraping Facebook for comments https://www.youtube.com/live/03oTYPm12y8?feature=share
If you're interested in seeing puppeteer in action I started doing streams last month where I talk through my method. I’ll be posting a lot more since it's been very fun.
Overall puppeteer is great because you get to easily inject js scripts in a nice API. Selenium is great too but not as developed of a web scraping interface imo. Also puppeteer is a very optimized headless browser which is a given. What really matters is implementing a VPN proxy and storing your cookies during auth routines which I can get into if you have any questions about that.
[+] [-] adriancooney|2 years ago|reply
[+] [-] odysseus|2 years ago|reply
Searching the heap manually is not working very well. The data I want is in a (very) long list of irrelevant values within a "strings" key. It might have something to do with the data on the page that I want to scrape being rendered by JavaScript.
[+] [-] bnchrch|2 years ago|reply
As I understand it this only works for SPAs or other heavy js frontends and would not work on HTML.
I think that’s fine.
What I’m really excited is this combined with traditional mark up scanning plus (incoming buzz word) AI.
Scraping is slowly becoming unstoppable and that a good thing.
[+] [-] koromak|2 years ago|reply
[+] [-] topherjaynes|2 years ago|reply
[+] [-] svdr|2 years ago|reply
[+] [-] simonw|2 years ago|reply
Or access the accessibility tree of the page using https://shot-scraper.datasette.io/en/stable/accessibility.ht...
Output here: https://gist.github.com/simonw/5174380dcd8c979af02e3dd74051a...[+] [-] michaelt|2 years ago|reply
It's not like you can rely on a dictionary to confirm you've correctly OCRed a post by "@4EyedJediO" - who knows if that's an O or a 0 at the end?
And if you're OCRing the title and view count of a youtube video, for example, you've got to take the page layout into account because there's a recommendations sidebar full of other titles with different view counts.
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] is_true|2 years ago|reply
[+] [-] berkle4455|2 years ago|reply
[+] [-] ekianjo|2 years ago|reply
[+] [-] spaniard89277|2 years ago|reply
[+] [-] c0balt|2 years ago|reply
However this is a nice hack around "modern" page structures and kudos to the author for making a proper tool out of it.
[+] [-] madeofpalk|2 years ago|reply
[+] [-] bugsliker|2 years ago|reply
[+] [-] tantalor|2 years ago|reply
That would seem to be the actually interesting/challenging part.
[+] [-] ricklamers|2 years ago|reply
https://developer.chrome.com/docs/devtools/console/utilities...
[+] [-] jackbeck|2 years ago|reply
[+] [-] unixfox|2 years ago|reply
[+] [-] _boffin_|2 years ago|reply
[+] [-] simonw|2 years ago|reply
I have a strip-tags CLI tool which I can pipe HTML through on its way to an LLM, described here: https://simonwillison.net/2023/May/18/cli-tools-for-llms/
I also do things like this:
Output here: https://gist.github.com/simonw/3fbfa44f83e12f9451b58b5954514...That's using https://shot-scraper.datasette.io/ to get just the document.body.innerText as a raw string, then piping that to gpt-3.5-turbo with a system prompt.
In terms of retaining context, I added a feature to my strip-tags tool where you can ask it to NOT strip specific tags - e.g.:
That strips all HTML tags except for h1, h2 and h3 - output here: https://gist.github.com/simonw/fefb92c6aba79f247dd4f8d5ecd88...Full documentation here: https://github.com/simonw/strip-tags/blob/main/README.md
[+] [-] solanav|2 years ago|reply
[+] [-] enson110|2 years ago|reply
I think one of the most challenging part of web scraping is dealing with the website's anti-scraping measures, such as it needs to sign in, encountering 403 forbidden error, and reCAPTCHA.
Does anyone have more experience in handling that?
[+] [-] elwell|2 years ago|reply
[+] [-] elwell|2 years ago|reply
[+] [-] dontupvoteme|2 years ago|reply
[+] [-] j0hnyl|2 years ago|reply
[+] [-] bonadrag|2 years ago|reply
[+] [-] 1-6|2 years ago|reply
[+] [-] _boffin_|2 years ago|reply