top | item 29905799

The State of Web Scraping 2022

291 points| Ian_Kerins | 4 years ago |scrapeops.io | reply

144 comments

order
[+] KieranMac|4 years ago|reply
As a lawyer whose primary focus is in web scraping, this article is in many ways misleading and inaccurate. While it is true that the Van Buren case is generally positive for web scraping, the overall legal landscape is still murky. The main battleground for web scraping legal issues is shifting from the CFAA to breach of contract and various state-law issues, including misappropriation, unjust enrichment, and trespass to chattels.

In my opinion, 2021 was a bad year for the law as it relates to web scraping. The Supreme Court remanded hiQ Labs, and many high-profile lower-court cases ended badly for web scrapers. It's a darker shade of gray than it was in 2020. It can be navigated, but it's tricky.

[+] btown|4 years ago|reply
Not a lawyer, but is it at least true that web scraping alone would now be significantly less likely to be a basis for federal criminal prosecution under the CFAA?

I'm often reminded of the fact that in https://en.wikipedia.org/wiki/United_States_v._Swartz the scraped party JSTOR did not desire to press civil charges, but due to the criminal component of the CFAA, this was out of their hands - and the story ended in the worst possible way.

If the current legal landscape at least better restricts disputes over web scraping to civil litigation, it may not be a huge change for how companies look at their risks, but it could make a huge difference for individuals caught in the crossfire.

[+] digitcatphd|4 years ago|reply
Good take, IMO ethically speaking we should not penalize scrapers themselves but do so based on their use.

Scraping Facebook to make a clone of profiles shouldn’t be held to the same scrutiny of scraping Facebook to do an internal analysis of user demographics for research purposes.

[+] RobSm|4 years ago|reply
How many contracts google breaches scraping billions of pages every month?
[+] faizshah|4 years ago|reply
Is there a good blog or something that tracks these cases?
[+] samcrawford|4 years ago|reply
Enjoyed reading your bio on your website. Sub 24 hour at Leadville is super impressive! (Coming from someone who has not managed 24 hours at Western States... Yet...)
[+] Seattle3503|4 years ago|reply
Is there a good blog post or summary that I could read?
[+] ok_coo|4 years ago|reply
Time for me to advocate again for people to use Common Crawl. Please don't slam peoples' websites, look for alternatives before scraping. There are probably other, better options. APIs, data set downloads, etc.

https://commoncrawl.org/

[+] dewey|4 years ago|reply
I'd guess that for the many popular scraping uses cases this is not really useful as it's usually about being quick and up to date (job postings, availability information, e-commerce, serps,...) not about having a big corpus of historic data.
[+] weird-eye-issue|4 years ago|reply
Have you used this in real world scenarios? Or is it just a nice hypothetical that sounds great in theory but almost never works in practice?
[+] LunaSea|4 years ago|reply
Common Crawl is missing far too many URLs for it to be useful in a real world scenario.
[+] mycall|4 years ago|reply
I wish web.archive.org had an index by someone like common crawl. There is lots of great stuff on archive.org
[+] joe_91|4 years ago|reply
That looks like a great resource! How often is the data set "updated"?

I'd imagine most people's use cases need data which can change from day to day or week to week but I do think that this is fantastic if I was to have a project which was looking at data across a longer timeframe.

[+] jimkri|4 years ago|reply
That is too much data to parse for a simple website scrape.

I do think Common Crawl has a lot of potential for people to use instead of scraping, but I think its for larger projects. It gave me the idea to look at the links to ID if they are a business or non-business website

[+] joe_91|4 years ago|reply
I'm scraping about 30 sites for work at the moment, but have a few that are using Cloudflare which has been a b*tch to deal with. Tried numerous libraries and different proxy providers, but reliability is patchy. Previous fixes like https://github.com/Anorov/cloudflare-scrape don't seem to work anymore after Cloudflare updates, so I've switched to using a pretty optimised headless browser with good proxies instead.
[+] nanna|4 years ago|reply
I'm finding that Cloudflare is even blocking my RSS reader from requesting feeds behind their service. It's not even just scrapers at this point.
[+] nsonha|4 years ago|reply
> optimised headless browser with good proxies instead

are you saying you only had problem because you didn't use headless browser before and now with both headless and proxy it generally suffices to not be seen as scrapper?

[+] temp8964|4 years ago|reply
I think it will eventually goes to like stock trading. If you have a good strategy, you don't want to share with the world, because it will render your strategy useless.
[+] emptysea|4 years ago|reply
Is the “pretty optimized headless browser” an off the shelf thing, or something custom? Are you using playwright/puppeteer to drive it?
[+] valar_m|4 years ago|reply
Do you have any recommendations for the "good proxies" you mentioned?
[+] mellosouls|4 years ago|reply
With the right combination of proxies, user agents and browsers, you can scrape every website. Even those that seem unscrapable.

:

This outcome was great news for web scrapers, as it means that so long as a websites has made their data public you are not in violation of the CFAA when you scrape the data even if it is prohibited in some other way (T&Cs, robots.txt, etc).

Just because you can, doesn't mean you should. It would be better I think if there was a treatment of the ethics here, rather than a seemingly "ra-ra go bots" attitude, as though the only consideration is commercial.

[+] Ian_Kerins|4 years ago|reply
100% agree, when scraping it should always be done respectfully.

- If they provide a API, then use it.

- Don't slam a website, ideally spread it out over hours of the day when there target audience is least active (night time).

- If you can get cached data from somewhere that works, then use that.

Most developers are respectful and only scrape what they really need, not only from an ethical point of view but also a cost and resources point of view. Scraping data is resource intensive and proxy costs can quickly rise to $1,000-$10,000 per month. So most only scrape the minimum they need.

The other thing here as well, is that a lot of the most popular sites being scraped, are also massive scrapers themselves. The big ecommerce sites are being scraped, but they are also scraping their competitors too.

[+] Terry_Roll|4 years ago|reply
You dont even need to do that, go overt plain sight in yer face and call yourself a search engine!
[+] bryanrasmussen|4 years ago|reply
this sort of implies that the 'ethics' would end up meaning that you shouldn't scrape if it is not wanted, although I suppose there can be ethics or other than commercial requirements that mean that you should.
[+] NDizzle|4 years ago|reply
I still have a daily job running a web scraper I first wrote with Scrapy back in 2017. I think I've had to update it 3 times over the years for changes to the site and web standards.

Good old government sites - rarely change!

[+] bobblywobbles|4 years ago|reply
Not a lawyer, but many terms of service prohibit interacting with their website in an automated fashion, as well as collecting their data. In my understanding, scraping a site with these terms already puts you in the wrong.
[+] cblconfederate|4 years ago|reply
Cloudflare's blocks get in the way of many websites who are simply trying to get a "link preview" of the page, even if it is only a single request from a new IP. I wish they would offer some kind of alternative for the pages they serve instead of a captcha block.
[+] fareesh|4 years ago|reply
My toolbox of choice for web scraping is either Nokogiri or puppeteer

Can someone sell me on beautiful soup or scrapy or any of the others? Do they provide any advantages or features that I'd be missing out on?

[+] gmanis|4 years ago|reply
What does HN think of web scraping for the purpose of price comparison?

I’m asking this because I run a small side project to show prices across retailers for a very small niche. The users are very very happy. Even the vendors started contacting to be listed on the comparison.

But I am unable to make a business out of it other than few affiliate commission.

[+] Ian_Kerins|4 years ago|reply
If anyone has anything else they think was missed or should be included then let me know!
[+] coverj|4 years ago|reply
I have been interested in web scraping lately but never really dived too deep. Did anyone have more indepth resources (github projects, blogs, forums, etc) than the tutorials that are basically install beautiful soup and get data from a tag?
[+] newsbinator|4 years ago|reply
Like most here, I am very good at web scraping and automated form fills. I keep trying to figure out a profitable side project or business idea to make out of it and keep coming up with nothing that works.

Any good ideas?

[+] darepublic|4 years ago|reply
Separate from web scraping, there is the use of automation to perform normal allowable user actions on the site. That should be considered distinct from large scale data extraction no
[+] JJxFile|4 years ago|reply
The web scraping ecosystem is growing, with more libraries, frameworks and products available than ever before to simplify our web scraping headaches so the future is looking bright.
[+] slvrspoon|4 years ago|reply
for those in this thread with super-serious experience scraping and automating at scale, looking for work (ethical!) please contact me directly.
[+] blantonl|4 years ago|reply
I fail to understand why Web Scraping isn't almost universally viewed as unethical and a terrible and nasty business practice.

In almost all cases I view Web scraping as people who are trying to build businesses on top of other people's innovation and data. I know this isn't a popular opinion, so change my mind, but at the same time, I'm one of those business owners that fights with Web scraping constantly and my opinion of it is that those that are doing it to my platforms are doing so solely to steal data and build businesses on top of other's hard work.

[+] xrendan|4 years ago|reply
I think it really depends on the application of web scraping. (As someone who does, what is in my mind, ethical web scraping)

- Scraping public information from government websites to do analysis: ethical, it's the public's data

- Scraping to help some companies customers more effectively use that companies product, for example scraping a medical office's insurance claims to help them automate their insurance remittance process: ethical

- Scraping faces to build a surveillance-tech company: disgusting

- Scraping your own website because your internal processes are so broken you can't get it any other way: ethical

- Scraping to just copy someone's data they worked hard to generate to go and resell: unethical

[+] yashasolutions|4 years ago|reply
Google is web scrapper number one, as any search engine. Making web scrapping illegal mean making search engine illegal.

You do not want information to be public and/or free? Put it under login and charge for it.

You want to prevent people to reuse the data you publish to build other (potentially competitive) products, then use licensing and copyright, and the law.

However, banning a technological mean because what a minority could potentially do with it? Then make the internet illegal then and the problem is fixed altogether.

[+] indymike|4 years ago|reply
> I fail to understand why Web Scraping isn't almost universally viewed as unethical and a terrible and nasty business practice.

Scraping is simply a way to get data. I used to run a team that was paid by large government contractors in the US to scrape their job posts from their career portals, and then deliver those posts via email, fax and snail mail to veteran's service officers near the job opening. It was required by regulation, and the only way to get the job data was to scrape.Many enterprise applicant tracking systems did not have a good way to automatically deliver that data or wanted $millions for that capability. Scraping was the best way and in some cases, the only way.

By the way, search engines like Google are scrape data and index it.

[+] Ian_Kerins|4 years ago|reply
Some web scraping can be unethical, say for example if you are scraping a site solely to mirror their content and add zero value to the original content owner.

However, there are a lot of web scraping use cases which are beneficial to the site being scraped and actually add value. Two examples:

- Google: Ahrefs & SEMRush scrape Google so they can provide SEO analytics to companies looking to grow their companies. Googles keyword analytics aren't great, so Google has effectively outsourced providing a good analytics tool to Ahrefs & SEMRush who products increase the value of the Google SERPs ecosystem.

- Amazon + Other E-Commerce: Amazon wants brands and 3rd party stores to list products on their site, and the companies scraping Amazon to provide product placement tools to their users make it easier and more profitable to list products on Amazon. Leading to more and more companies listing products on Amazon.

[+] kbenson|4 years ago|reply
Do you provide an API, paid or not, for the same data? An API which might even have limitations on use makes scraping a bit less defensible in my mind, but if you're offering something for free to the public and then getting upset when people take and use that free info, maybe free isn't the right business model, or maybe you should look into what those people are using that scraped data for and see if you can offer it better and cheaper.

The best way to stop someone trying to make a buck on your hard work is to go direct to their customers and do a better job. If you can't, what they're selling is something on top of your offering and you aren't serving that market, and you either should start serving it, or make a deal so the scrapers can continue to do it without impacting your service.

As someone that had to do scraping in the past, and went through having a free open API that served our needs perfectly replaced with an account based one that required we make 100x the queries, it was really frustrating that the company refused to even respond to queries for specific business accomodations to data.

[+] charcircuit|4 years ago|reply
Here are two use cases why I scrape YouTube.

- There is no external API for getting scheduled streams or when they have gone live AFAIK. This lets me be notified of new stuff to watch.

- The API for getting a channel's members is locked down. I applied for access to it 6 months ago and haven't heard anything about it from YouTube so I just scrape it to give members perks.

[+] KieranMac|4 years ago|reply
There are pro-social and anti-social uses of web scraping. If you have ever used Kayak or any other price discovery or price comparison website, you've relied on web scraping to provide you a service.
[+] mrtksn|4 years ago|reply
When I want to do web scraping is because I have an idea to build over the content of the website I would like to scrape.

Let's say you made a recipes website and I would like to build an app that will order the ingredients for a meal.

It would be useful to extract the recipes, so that I can create experiences like users picking a meal and have the ingredients delivered.

I guess I can't show your recipes as it can be copyright infringement but I can link it to you and sell the tomatoes.

Also, despite copying someones work is unethical and likely illegal , there is nothing unethical or illegal to use computers to analyse the data out there. I should be able to analyse recipe publications just as I can measure the air pollution. The web scarping comes in since the semantic web never happen.

I think, we all should be able to use other people's work to build something else on top of it. Of course I do not advocate outright taking it and re-sell it as of ours.

For example, I would like to be able to create an app with Netflix content but obviously I don't expect to be able to stream their content as if it is mine. What I should be able to do is to create an app with an experience designed by me that lets you stream their movies if you pay them.

[+] julianeon|4 years ago|reply
Because there would no Internet search - no search engines, no Google Search, and essentially no Internet bigger than a hobbyist DARPA - without web scraping.
[+] Chris2048|4 years ago|reply
> people who are trying to build businesses on top of other people's innovation and data

How would scraping, say, reddit, differ from the business model of Reddit itself?

> those that are doing it to my platforms are doing so solely to steal data

What kind of data are you talking about?