How to Crawl the Web Politely with Scrapy

markpapadakis|9 years ago

In the past we built and operated Greece’s largest search engine(Trinity), and we would crawl/refresh all Greek pages fairly regularly.

If memory serves, the frequency was computed for clusters of pages from the same site, and it depended on how often they were updated(news sites front-pages were in practice different in successive updates, whereas e.g users homepage were not, they rarely were updated), and how resilient the sites were to aggressive indexing (if they ‘d fail or timeout, or it ‘d take longer than expected to download the page contents than what we expected based on site-wide aggregated metrics, we ‘d adjust the frequency, etc).

The crawlers were all draining multiple queues, whereas URLs from the same site would always end up on the same queue(via consistent hashing, based on the hostname’s hash), so a single crawler process was responsible for throttling requests and respecting robots.txt rules for any single site, without need for cross-crawler state synchronisation.

In practice this worked quite well. Also, this was before Google and its PageRank and social networks (we ‘d probably have also considered pages popularity based on PageRank like metrics and social ‘signals’ in the frequency computation, among other variables).

greglindahl|9 years ago

In the current web, sites like Amazon are so large that you'll need many crawlers. On the plus side, it appears that almost all large sites don't have rate limits.

atmosx|9 years ago

Hi Mark, out of curiosity, which search engine is that?

elorant|9 years ago

In my experience the best way to crawl in a polite way is to never use an asynchronous crawler. The vast majority of small to medium sites out there have absolutely no kind of protection from an aggressive crawler. You make 50 to 100 requests per second chances are you’re DDoS-ing the shit out of most sites.

As for robots.txt problem is most sites don’t even have one. Especially e-commerce sites. They also don’t have a sitemap.xml in case you don’t want to hit every url just to find the structure of the site. Being polite in many cases takes a considerable effort.

stummjr|9 years ago

Scrapy is asynchronous, but it provides many settings that you can use to avoid DDoS a website, such as limiting the amount of simultaneous requests for each domain or IP address.

And yes, crawling politely requires a bit of effort from both ends: the crawler and the website.

jsargiox|9 years ago

I agree, at the end of the day being polite or not is on the developer and not the tool itself...

greglindahl|9 years ago

Search engine crawlers use adaptive politeness: start being very polite, and ramp up parallel fetches if the site responds quickly and has a lot of pages.

zeroxfe|9 years ago

You can rate-limit asynchronous crawlers too.

minimaxir|9 years ago

See also Tuesday's HN discussion on the ethics of data scraping (https://news.ycombinator.com/item?id=12345952), in which Hacker News is completely split on whether data scraping is ethical even if the Terms of Service explicitly forbids it.

cookiecaper|9 years ago

I wouldn't say completely split. I think most of HN considers the current state of scraping law to be complete and utter hogwash. Many of the expected consumer rights don't apply online because of the way the law considers normal communication with a server on the internet an excursion onto private property.

We need a modern law addressing these issues instead of the pre-Internet CFAA. Malicious actors should still be punished, and it may be reasonable to still allow a provision for the civil liability (not criminal) of large-scale accidental DoS from poorly-implemented scrapers, but users should be free to choose their own browsing devices -- even if those browsing devices are highly optimized to extract only the specific pieces of data that the user cares about.

This law should also clarify that normal communication over HTTP cannot be punished unless the plaintiff can demonstrate real and serious interruption to their services, that local RAM copies that are never externally transmitted cannot be considered infringing in themselves, that hosting a site on the internet grants an implied copyright license to read and access its content with any HTTP-capable client, and that browsewrap/clickwrap contracts are unenforceable unless the user undertakes a significant relationship with the company, among other things.

emodendroket|9 years ago

Are you trying to imply that's a ridiculous position? I don't see it as one.

unknown|9 years ago

[deleted]

tangue|9 years ago

Reading the previous thread again, I suppose that many of those against scraping didn't realized they've already lost : with Ghost, Phantom, and now headless Chrome you're going to have a hard time to detect a well built scraper.

Instead of fighting against scrapers that don't want to harm you, maybe it's about time to invest in your robots.txt and cooperate.

You could say that scraping you're website is FORBIDDEN, but come on : if Airbnb can rent houses, I can scrap you site.

cookiecaper|9 years ago

>Reading the previous thread again, I suppose that many of those against scraping didn't realized they've already lost : with Ghost, Phantom, and now headless Chrome you're going to have a hard time to detect a well built scraper.

Unfortunately, if you're scraping some data that only has one authoritative data source, they'll know you're scraping them even if they can't distinguish your individual requests from the general traffic.

This is what happened to my company. It didn't stop them from pretending that we were setting their servers on fire, even though they had no way to know whether we were or not since they couldn't distinguish our traffic from that generated by other browsers.

We were scraping only factual data in the which the company cannot hold a copyright interest. Nonetheless, under Ticketmaster v. RMG, just holding a copy of a page in RAM long enough to parse it constitutes infringement (you have to prove fair use, as Google supposedly did in Perfect 10 v. Google, to avoid this).

The difference between yourself and Google/airbnb is that the latter have a lot of money and are trendy technology companies, and you don't and aren't (yet).

The lesson is become really big before someone sues you and the judiciary will be on your side.

FussyZeus|9 years ago

It depends on your definition of harm. When your product is what's published on the websites and you regularly find ripoffs of said website publishing your ripped off content, maybe you'd feel differently about it.

betolink|9 years ago

I worked on a research project to develop a web-scale "google" for scientific data and we found very interesting things on robots.txt, from "don't crawl us" to "crawl 1 page every other day" or even better "don't crawl unless you're google".

Another thing we noticed is that google's crawler is kind of aggressive, I guess they are in a position to do it.

Our paper in case someone is interested: Optimizing Apache Nutch for domain specific crawling at large scale (http://ieeexplore.ieee.org/document/7363976/?arnumber=736397...)

AznHisoka|9 years ago

This is why I think Google's position as the #1 search engine will never go away. Many sites will tell your bot to go away if you're not Google. They don't care if you're building a search engine that will compete with Google.

vonklaus|9 years ago

The current protocols promote data exchange and since websites are primarily designed to be consumed, there is really no way to stop automated requests. Even companies like distilli[1] networks that parse inflight requests have trouble stopping any sufficiently motivated outfit.

I think data should be disseminated and free info exchange is great. If possible, devs should respect website owners as much as possible; although in my experience people seem to be more willing to rip off large "faceless" sites rather than mom&&pops. Both because that is where valuable data is, and it seems more justifiable even if morally gray.

Regardless, the thing I find most interesting is that Google is most often criticized for selling user data/out their users privacy. However, it is oft not mentioned that Googlebot & the army of chrome browsers are not only permitted, but encouraged to crawl all sites except a scant few that gave achieved escape velocity. Sites that wish to protect their data must disallow and forcibly stop most crawlers except google, otherwise they will be unranked. This creates an odd dichotomy where not only does google retain massive leverage, but another search engine or aggregator has more hurdles and less resources to compete.

[1] They protect crunchbase and many media companies.

libeclipse|9 years ago

If you're worried about being a pain in the ass to administrators, with a web-scraper, they probably need to rethink the way they have their website set up.

novaleaf|9 years ago

An alternative to Scrapinghub: PhantomJsCloud.com

It's a bit more "raw" than Scrapinghub but full featured and cheap.

Disclaimer: I'm the author!

41 comments