Scraping things that don't want to be scraped is one of my favorite things to do. At work this is usually an interface for some sort of "network appliance." Though with the push for REST APIs over the last 6 years or so, I don't have a need to do it all to often. Plus with things like selenium it's too easy to just run the page as is, and I can't justify spending the time to figuring out the undocumented API.
My favorite one implemented CSRF protections by polling an endpoint, and adding in the hashed data from that endpoint and a timestamp on every request.
When I hear a junior dev give up on something because the API doesn't provide the functionality of the UI, It makes me very sad that they're missing out.
To be fair selenium style scraping can take a lot of time to setup if you aren’t already familiar with the tooling, and the browser rendering apis are unintuitive and sometimes flat out broken.
I remember a workmate having to deal with some difficult to scrape data at a previous job - the page randomly rendered with different mark-up (but the same appearance) to mitigate pulling out data using selectors. I think he got to the bottom of it eventually but it made testing his work a pain.
> It looks like the author of the article just googled some libraries for each language and didn't research the topic
Yep, this seemed like an aggregate Google results page.
I was initially intrigued by the article and then realized it was a list of libraries the author found via Google. There were significantly notable omissions from this list and a bunch of weird stuff that feels unnecessary. I don't think the author has actually scraped a page before.
I agree with your conclusion, but in any discussion about web scraping it's probably a good idea to mention BeautifulSoup given how popular it is (virtually a builtin in terms how much it's used) and given all the documentation available for it, a good starting point if perf is not going to be a concern.
although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_
> Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.
requests-html is faster than bs4 using lxml. It's a wrapper over lxml. I built something similar years ago using a similar method, it was much faster than bs4, too.
If you're familiar with Go, there's Colly too [1]. I liked its simplicity and approach and even wrote a little wrapper around it to run it via Docker and a config file:
I used this library to get familiar with Go. It is indeed very powerful and really easy to create a scraper.
My main concerns though were about testing. What if you want to create tests to check if your scraper still gets the data we want? Colly allows nested scraping and it's easy to implement but you have all your logic into one big function, making it harder to test.
Did you find a solution to this? I'm considering switching to net/http + GoQuery only to have more freedom.
I am scraping radio broadcast pages for a decade now. Started with (ruby) scrapy, then nokogiri, then moved on to go and their html package.
Currently sport a mix of curl + grep + xsltproc + lambdasoup (OCaml) and am happy with it. Sounds like a mess but is shallow, inspectable, changeable and concise. http://purl.mro.name/recorder
Last year I needed some quick scraping and I used a headless Chromium to render webpages and print the HTML then analyze it with C#.
I don't remember exactly, but I think it was around 100 or 200 loc, so not exactly something that took long to write.
In fact the most difficult thing was to figure how to pass the right args to Chromium.
HTTP requests, HTML parsing, crawling, data extraction, wrapping complex browser APIs etc. Nothing you couldn't do yourself, but like most frameworks, they abstract the messy details so you can get a scraper working quickly without having to cobble together a bunch of libraries or re-invent the wheel.
Just for one example, when you have to get a form, and then submit the form, with the CSRF protection that was in the form... of course you COULD write that yourself by printing HTML and then analyzing it with C# (which triggers more requests to chromium I guess), but you're probably going to wonder why you are reinventing the wheel when you want to be getting on to the domain-specific stuff.
Throttling is a prime example. If you start loading multitudes of sites in asynchronous fashion you'll have to enter some delay otherwise you run the risk of choking the server in misconfigured sites. I've DDoSed sites accidentally this way. You can of course build a framework on your own, and that's pretty much what every scraper does eventually, but it takes time and a lot of effort.
I think another technique that should be talked about is intercepting network responses as they happen. The web in 2021 still has a whole lot of client-side rendering. For those sites, data is often loaded on the fly with separate network calls (usually with some sort of nonce or contextual key). Much of the hassle in web scraping can be avoided by listening for that specific response instead of parsing an artifact of the JSON->JS->HTML process.
I put together a toy site [0] recently that uses this approach for JIT price comparisons of events. When you click on an event, the backend navigates to requested ticket provider pages through a pool of Puppeteer instances and waits for JSON responses with pricing data.
Slight aside: The most recent Cloudflare HCaptchas ask you to classify AI generated images. They don’t even look like a proper bike/truck/whatever (I don’t have an example handy).
I categorically refuse to do when I’m browsing websites using it. I find this new captcha utterly unacceptable.
It’s no “protection” at this point anymore. Websites are using it as an excuse to become even more user hostile. I am worried for the future of the web.
I've been working on a scraping project in Scrapy over the last month, using Selenium as well. My Python skills are mediocre (mostly a Java/Kotlin dev).
Not only has it been a blast to try out, but also surprisingly easy to setup.
I now have around 11 domains being scraped 4 times a day through a well defined pipeline + ETL then pipes it to Firebase Firestore for consumption.
My SaaS requires some technical knowledge to use (call a web api) which I suppose is why it's not ever in these lists.
Some of my customers are *very* large businesses. If you are looking at evading bot countermeasures, my product isn't probably the best for you. but for TCO nothing beats it.
> Crawl at off-peak traffic times. If a news service has most of its users present between 9 am and 10 pm – then it might be good to crawl around 11 pm or in the wee hours of the morning.
For sites where there is a peak usage time, it's probably obvious what that peak usage time is. A news service (their example) presumably primarily serves a country or a region - then off-peak traffic times are likely at night.
The Internet has no time zone, but its human users all do.
I moved from selenium to playwright. It has a pleasant API and things just works our of box. I ran into odd problems with selenium before, especially when waiting for a specific element. Selenium didn't register it, but I could see it load.
It was uncharacteristic of me, because I tend to use boring, older technologies. But this gamble paid off for me.
No completely different use case. Selenium is browser automation. Mechanical soup/Mechanize/Robobrowser are not actually web browsers, they have no javascript support either. They're python libraries that can simulate a web browser but doing GET requests, storing cookies across requests, filling http POST forms, etc.
The downside is that they don't work with websites which rely on JavaScript to load content. But if you're scraping a website like that, then it might be easier and way way faster to analyze web requests using dev tools or mitmproxy, then automating those API calls instead of automating a browser.
Nowadays is more and more common for websites to have some kind of rate limiting middleware such as rack attack for ruby. It would be interesting to explore the strategies to deal with it.
I tried Python/ BeautifulSoup and Node/Puppeteer recently. It may be because my Python is poor, but puppeteer seemed more natural to me. Injecting functionality into a properly formed web page felt quite powerful and started me thinking about what you could do with it.
self promotion: I launched my no-code scraping cloud software on ProductHunt last month after a year of testing with beta users: https://www.producthunt.com/posts/browse-ai
[+] [-] dec0dedab0de|4 years ago|reply
My favorite one implemented CSRF protections by polling an endpoint, and adding in the hashed data from that endpoint and a timestamp on every request.
When I hear a junior dev give up on something because the API doesn't provide the functionality of the UI, It makes me very sad that they're missing out.
[+] [-] tmpz22|4 years ago|reply
[+] [-] eastendguy|4 years ago|reply
If all else fails, no website can withstand OCR-based screen scraping. It is slow(er), but fast enough for many use cases.
[+] [-] jamesfinlayson|4 years ago|reply
[+] [-] f311a|4 years ago|reply
Also, I think very few people use MechanicalSoup nowadays. There are libraries that allow you to use headless Chrome, e.g. Playwright.
It looks like the author of the article just googled some libraries for each language and didn't research the topic.
[+] [-] jacurtis|4 years ago|reply
Yep, this seemed like an aggregate Google results page.
I was initially intrigued by the article and then realized it was a list of libraries the author found via Google. There were significantly notable omissions from this list and a bunch of weird stuff that feels unnecessary. I don't think the author has actually scraped a page before.
[+] [-] xnyan|4 years ago|reply
[+] [-] mdaniel|4 years ago|reply
although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_
> Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.
although its other dep seems much more cognizant about the HTML5 standard, for whatever that's worth: https://github.com/lexbor/lexbor#lexbor
---
> It looks like the author of the article just googled some libraries for each language and didn't research the topic
Heh, oh, new to the Internet, are you? :-D
[+] [-] heavyset_go|4 years ago|reply
[+] [-] m_ke|4 years ago|reply
Here's one: https://chrome.google.com/webstore/detail/headless-recorder/...
[+] [-] sidharthv|4 years ago|reply
npx playwright codegen wikipedia.org
https://playwright.dev/docs/next/codegen
[+] [-] jrochkind1|4 years ago|reply
When I google, I see it advertised as a "testing" tool.
Can I also use it for scraping? Where would I learn more about doing so?
[+] [-] benzible|4 years ago|reply
$30 / month for 300K requests, rotating residential proxies, uses headless Chromium, etc.
[+] [-] colinramsay|4 years ago|reply
https://gotripod.com/insights/super-simple-site-crawling-and...
[1] http://go-colly.org/
[+] [-] hivacruz|4 years ago|reply
My main concerns though were about testing. What if you want to create tests to check if your scraper still gets the data we want? Colly allows nested scraping and it's easy to implement but you have all your logic into one big function, making it harder to test.
Did you find a solution to this? I'm considering switching to net/http + GoQuery only to have more freedom.
[+] [-] IceWreck|4 years ago|reply
[deleted]
[+] [-] mro_name|4 years ago|reply
Currently sport a mix of curl + grep + xsltproc + lambdasoup (OCaml) and am happy with it. Sounds like a mess but is shallow, inspectable, changeable and concise. http://purl.mro.name/recorder
[+] [-] jmnicolas|4 years ago|reply
I don't remember exactly, but I think it was around 100 or 200 loc, so not exactly something that took long to write. In fact the most difficult thing was to figure how to pass the right args to Chromium.
I wonder what does a scraping framework offer?
[+] [-] Veen|4 years ago|reply
HTTP requests, HTML parsing, crawling, data extraction, wrapping complex browser APIs etc. Nothing you couldn't do yourself, but like most frameworks, they abstract the messy details so you can get a scraper working quickly without having to cobble together a bunch of libraries or re-invent the wheel.
[+] [-] jrochkind1|4 years ago|reply
[+] [-] elorant|4 years ago|reply
[+] [-] krakengerry|4 years ago|reply
I put together a toy site [0] recently that uses this approach for JIT price comparisons of events. When you click on an event, the backend navigates to requested ticket provider pages through a pool of Puppeteer instances and waits for JSON responses with pricing data.
[0] https://www.wyzetickets.com
[+] [-] marban|4 years ago|reply
[+] [-] omneity|4 years ago|reply
I categorically refuse to do when I’m browsing websites using it. I find this new captcha utterly unacceptable.
It’s no “protection” at this point anymore. Websites are using it as an excuse to become even more user hostile. I am worried for the future of the web.
[+] [-] password4321|4 years ago|reply
> Cloudflare's bot protection mostly makes use of TLS fingerprinting, and thus pretty easy to bypass.
https://news.ycombinator.com/item?id=28251700 -> https://github.com/refraction-networking/utls
Disclaimer: haven't tried it.
[+] [-] heavyset_go|4 years ago|reply
[+] [-] alphabet9000|4 years ago|reply
[+] [-] MrDresden|4 years ago|reply
Not only has it been a blast to try out, but also surprisingly easy to setup.
I now have around 11 domains being scraped 4 times a day through a well defined pipeline + ETL then pipes it to Firebase Firestore for consumption.
Next step is to write the page on top of it.
[+] [-] heavyset_go|4 years ago|reply
[+] [-] novaleaf|4 years ago|reply
https://PhantomJsCloud.com
My SaaS requires some technical knowledge to use (call a web api) which I suppose is why it's not ever in these lists.
Some of my customers are *very* large businesses. If you are looking at evading bot countermeasures, my product isn't probably the best for you. but for TCO nothing beats it.
[+] [-] toastal|4 years ago|reply
[+] [-] amelius|4 years ago|reply
How do you know this if it is not your website?
Also, the internet has no time zone.
[+] [-] chucky|4 years ago|reply
The Internet has no time zone, but its human users all do.
[+] [-] numeralls|4 years ago|reply
[+] [-] gcatalfamo|4 years ago|reply
[+] [-] duckmysick|4 years ago|reply
It was uncharacteristic of me, because I tend to use boring, older technologies. But this gamble paid off for me.
https://playwright.dev/
[+] [-] IceWreck|4 years ago|reply
No completely different use case. Selenium is browser automation. Mechanical soup/Mechanize/Robobrowser are not actually web browsers, they have no javascript support either. They're python libraries that can simulate a web browser but doing GET requests, storing cookies across requests, filling http POST forms, etc.
The downside is that they don't work with websites which rely on JavaScript to load content. But if you're scraping a website like that, then it might be easier and way way faster to analyze web requests using dev tools or mitmproxy, then automating those API calls instead of automating a browser.
[+] [-] nicoburns|4 years ago|reply
[+] [-] juanse|4 years ago|reply
[+] [-] beardyw|4 years ago|reply
[+] [-] Jenkins2000|4 years ago|reply
[+] [-] anjingchi|4 years ago|reply
[+] [-] say_it_as_it_is|4 years ago|reply
[+] [-] marvram|4 years ago|reply
[+] [-] ardalann|4 years ago|reply
Here are a few comparisons if you're curious:
- https://www.browse.ai/vs/hexomatic
- https://www.browse.ai/vs/import-io
- https://www.browse.ai/vs/octoparse
- https://www.browse.ai/vs/oxylabs
- https://www.browse.ai/vs/parsehub
- https://www.browse.ai/vs/simplescraper
- https://www.browse.ai/vs/webscraper
- https://www.browse.ai/vs/zyte
[+] [-] abzug|4 years ago|reply
[+] [-] marvram|4 years ago|reply