top | item 28827509

The State of Web Scraping in 2021

281 points| marvram | 4 years ago |mihaisplace.blog | reply

125 comments

order
[+] dec0dedab0de|4 years ago|reply
Scraping things that don't want to be scraped is one of my favorite things to do. At work this is usually an interface for some sort of "network appliance." Though with the push for REST APIs over the last 6 years or so, I don't have a need to do it all to often. Plus with things like selenium it's too easy to just run the page as is, and I can't justify spending the time to figuring out the undocumented API.

My favorite one implemented CSRF protections by polling an endpoint, and adding in the hashed data from that endpoint and a timestamp on every request.

When I hear a junior dev give up on something because the API doesn't provide the functionality of the UI, It makes me very sad that they're missing out.

[+] tmpz22|4 years ago|reply
To be fair selenium style scraping can take a lot of time to setup if you aren’t already familiar with the tooling, and the browser rendering apis are unintuitive and sometimes flat out broken.
[+] eastendguy|4 years ago|reply
> Scraping things that don't want to be scraped

If all else fails, no website can withstand OCR-based screen scraping. It is slow(er), but fast enough for many use cases.

[+] jamesfinlayson|4 years ago|reply
I remember a workmate having to deal with some difficult to scrape data at a previous job - the page randomly rendered with different mark-up (but the same appearance) to mitigate pulling out data using selectors. I think he got to the bottom of it eventually but it made testing his work a pain.
[+] f311a|4 years ago|reply
For Python, instead of BeautifulSoup I prefer to use selectolax which is 3-5 times faster.

Also, I think very few people use MechanicalSoup nowadays. There are libraries that allow you to use headless Chrome, e.g. Playwright.

It looks like the author of the article just googled some libraries for each language and didn't research the topic.

[+] jacurtis|4 years ago|reply
> It looks like the author of the article just googled some libraries for each language and didn't research the topic

Yep, this seemed like an aggregate Google results page.

I was initially intrigued by the article and then realized it was a list of libraries the author found via Google. There were significantly notable omissions from this list and a bunch of weird stuff that feels unnecessary. I don't think the author has actually scraped a page before.

[+] xnyan|4 years ago|reply
I agree with your conclusion, but in any discussion about web scraping it's probably a good idea to mention BeautifulSoup given how popular it is (virtually a builtin in terms how much it's used) and given all the documentation available for it, a good starting point if perf is not going to be a concern.
[+] mdaniel|4 years ago|reply
Lazyweb link: https://github.com/rushter/selectolax

although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_

> Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.

although its other dep seems much more cognizant about the HTML5 standard, for whatever that's worth: https://github.com/lexbor/lexbor#lexbor

---

> It looks like the author of the article just googled some libraries for each language and didn't research the topic

Heh, oh, new to the Internet, are you? :-D

[+] heavyset_go|4 years ago|reply
requests-html is faster than bs4 using lxml. It's a wrapper over lxml. I built something similar years ago using a similar method, it was much faster than bs4, too.
[+] m_ke|4 years ago|reply
Another tip, there are a few browser extensions that can record your interactions and generate a playwright script.

Here's one: https://chrome.google.com/webstore/detail/headless-recorder/...

[+] jrochkind1|4 years ago|reply
I'm not familiar with "playwright", it doesn't seem to be mentioned in OP either.

When I google, I see it advertised as a "testing" tool.

Can I also use it for scraping? Where would I learn more about doing so?

[+] colinramsay|4 years ago|reply
If you're familiar with Go, there's Colly too [1]. I liked its simplicity and approach and even wrote a little wrapper around it to run it via Docker and a config file:

https://gotripod.com/insights/super-simple-site-crawling-and...

[1] http://go-colly.org/

[+] hivacruz|4 years ago|reply
I used this library to get familiar with Go. It is indeed very powerful and really easy to create a scraper.

My main concerns though were about testing. What if you want to create tests to check if your scraper still gets the data we want? Colly allows nested scraping and it's easy to implement but you have all your logic into one big function, making it harder to test.

Did you find a solution to this? I'm considering switching to net/http + GoQuery only to have more freedom.

[+] mro_name|4 years ago|reply
I am scraping radio broadcast pages for a decade now. Started with (ruby) scrapy, then nokogiri, then moved on to go and their html package.

Currently sport a mix of curl + grep + xsltproc + lambdasoup (OCaml) and am happy with it. Sounds like a mess but is shallow, inspectable, changeable and concise. http://purl.mro.name/recorder

[+] jmnicolas|4 years ago|reply
Last year I needed some quick scraping and I used a headless Chromium to render webpages and print the HTML then analyze it with C#.

I don't remember exactly, but I think it was around 100 or 200 loc, so not exactly something that took long to write. In fact the most difficult thing was to figure how to pass the right args to Chromium.

I wonder what does a scraping framework offer?

[+] Veen|4 years ago|reply
> I wonder what does a scraping framework offer?

HTTP requests, HTML parsing, crawling, data extraction, wrapping complex browser APIs etc. Nothing you couldn't do yourself, but like most frameworks, they abstract the messy details so you can get a scraper working quickly without having to cobble together a bunch of libraries or re-invent the wheel.

[+] jrochkind1|4 years ago|reply
Just for one example, when you have to get a form, and then submit the form, with the CSRF protection that was in the form... of course you COULD write that yourself by printing HTML and then analyzing it with C# (which triggers more requests to chromium I guess), but you're probably going to wonder why you are reinventing the wheel when you want to be getting on to the domain-specific stuff.
[+] elorant|4 years ago|reply
Throttling is a prime example. If you start loading multitudes of sites in asynchronous fashion you'll have to enter some delay otherwise you run the risk of choking the server in misconfigured sites. I've DDoSed sites accidentally this way. You can of course build a framework on your own, and that's pretty much what every scraper does eventually, but it takes time and a lot of effort.
[+] krakengerry|4 years ago|reply
I think another technique that should be talked about is intercepting network responses as they happen. The web in 2021 still has a whole lot of client-side rendering. For those sites, data is often loaded on the fly with separate network calls (usually with some sort of nonce or contextual key). Much of the hassle in web scraping can be avoided by listening for that specific response instead of parsing an artifact of the JSON->JS->HTML process.

I put together a toy site [0] recently that uses this approach for JIT price comparisons of events. When you click on an event, the backend navigates to requested ticket provider pages through a pool of Puppeteer instances and waits for JSON responses with pricing data.

[0] https://www.wyzetickets.com

[+] marban|4 years ago|reply
Cloudflare's protection is quite a b*tch to circumvent with any headless or python library.
[+] omneity|4 years ago|reply
Slight aside: The most recent Cloudflare HCaptchas ask you to classify AI generated images. They don’t even look like a proper bike/truck/whatever (I don’t have an example handy).

I categorically refuse to do when I’m browsing websites using it. I find this new captcha utterly unacceptable.

It’s no “protection” at this point anymore. Websites are using it as an excuse to become even more user hostile. I am worried for the future of the web.

[+] heavyset_go|4 years ago|reply
It's a pain even when you aren't a bot. For a while there, Cloudflare's fingerprinting page would trigger Firefox on Linux to crash instantly.
[+] alphabet9000|4 years ago|reply
with node, i've had success with puppeteer-extra using puppeteer-extra-plugin-stealth
[+] MrDresden|4 years ago|reply
I've been working on a scraping project in Scrapy over the last month, using Selenium as well. My Python skills are mediocre (mostly a Java/Kotlin dev).

Not only has it been a blast to try out, but also surprisingly easy to setup.

I now have around 11 domains being scraped 4 times a day through a well defined pipeline + ETL then pipes it to Firebase Firestore for consumption.

Next step is to write the page on top of it.

[+] heavyset_go|4 years ago|reply
Are you using Scrapy mainly for scraping, or do you do crawling, as well?
[+] novaleaf|4 years ago|reply
Self promotion: my SaaS is the lowest cost web scraping tool for high volume, and has been in business since 2016.

https://PhantomJsCloud.com

My SaaS requires some technical knowledge to use (call a web api) which I suppose is why it's not ever in these lists.

Some of my customers are *very* large businesses. If you are looking at evading bot countermeasures, my product isn't probably the best for you. but for TCO nothing beats it.

[+] amelius|4 years ago|reply
> Crawl at off-peak traffic times. If a news service has most of its users present between 9 am and 10 pm – then it might be good to crawl around 11 pm or in the wee hours of the morning.

How do you know this if it is not your website?

Also, the internet has no time zone.

[+] chucky|4 years ago|reply
For sites where there is a peak usage time, it's probably obvious what that peak usage time is. A news service (their example) presumably primarily serves a country or a region - then off-peak traffic times are likely at night.

The Internet has no time zone, but its human users all do.

[+] numeralls|4 years ago|reply
If your scraping a popular website Google Trends should be a pretty good proxy
[+] gcatalfamo|4 years ago|reply
Why no mention of selenium? Is it not cool anymore? I have never heard of mechanicalsoup: is it selenium replacement?
[+] duckmysick|4 years ago|reply
I moved from selenium to playwright. It has a pleasant API and things just works our of box. I ran into odd problems with selenium before, especially when waiting for a specific element. Selenium didn't register it, but I could see it load.

It was uncharacteristic of me, because I tend to use boring, older technologies. But this gamble paid off for me.

https://playwright.dev/

[+] IceWreck|4 years ago|reply
> is it selenium replacement

No completely different use case. Selenium is browser automation. Mechanical soup/Mechanize/Robobrowser are not actually web browsers, they have no javascript support either. They're python libraries that can simulate a web browser but doing GET requests, storing cookies across requests, filling http POST forms, etc.

The downside is that they don't work with websites which rely on JavaScript to load content. But if you're scraping a website like that, then it might be easier and way way faster to analyze web requests using dev tools or mitmproxy, then automating those API calls instead of automating a browser.

[+] nicoburns|4 years ago|reply
Selenium is famously unreliable, so a lot of people have been replacing it with headless chrome where they can.
[+] juanse|4 years ago|reply
Nowadays is more and more common for websites to have some kind of rate limiting middleware such as rack attack for ruby. It would be interesting to explore the strategies to deal with it.
[+] beardyw|4 years ago|reply
I tried Python/ BeautifulSoup and Node/Puppeteer recently. It may be because my Python is poor, but puppeteer seemed more natural to me. Injecting functionality into a properly formed web page felt quite powerful and started me thinking about what you could do with it.
[+] ardalann|4 years ago|reply
[+] abzug|4 years ago|reply
On the Ruby side both Nokogiri and Mechanize should be mentioned...
[+] marvram|4 years ago|reply
Good call! ~ Will add them in the next version.