In the past we built and operated Greece’s largest search engine(Trinity), and we would crawl/refresh all Greek pages fairly regularly.
If memory serves, the frequency was computed for clusters of pages from the same site, and it depended on how often they were updated(news sites front-pages were in practice different in successive updates, whereas e.g users homepage were not, they rarely were updated), and how resilient the sites were to aggressive indexing (if they ‘d fail or timeout, or it ‘d take longer than expected to download the page contents than what we expected based on site-wide aggregated metrics, we ‘d adjust the frequency, etc).
The crawlers were all draining multiple queues, whereas URLs from the same site would always end up on the same queue(via consistent hashing, based on the hostname’s hash), so a single crawler process was responsible for throttling requests and respecting robots.txt rules for any single site, without need for cross-crawler state synchronisation.
In practice this worked quite well. Also, this was before Google and its PageRank and social networks (we ‘d probably have also considered pages popularity based on PageRank like metrics and social ‘signals’ in the frequency computation, among other variables).
In the current web, sites like Amazon are so large that you'll need many crawlers. On the plus side, it appears that almost all large sites don't have rate limits.
In my experience the best way to crawl in a polite way is to never use an asynchronous crawler. The vast majority of small to medium sites out there have absolutely no kind of protection from an aggressive crawler. You make 50 to 100 requests per second chances are you’re DDoS-ing the shit out of most sites.
As for robots.txt problem is most sites don’t even have one. Especially e-commerce sites. They also don’t have a sitemap.xml in case you don’t want to hit every url just to find the structure of the site. Being polite in many cases takes a considerable effort.
Scrapy is asynchronous, but it provides many settings that you can use to avoid DDoS a website, such as limiting the amount of simultaneous requests for each domain or IP address.
And yes, crawling politely requires a bit of effort from both ends: the crawler and the website.
Search engine crawlers use adaptive politeness: start being very polite, and ramp up parallel fetches if the site responds quickly and has a lot of pages.
See also Tuesday's HN discussion on the ethics of data scraping (https://news.ycombinator.com/item?id=12345952), in which Hacker News is completely split on whether data scraping is ethical even if the Terms of Service explicitly forbids it.
I wouldn't say completely split. I think most of HN considers the current state of scraping law to be complete and utter hogwash. Many of the expected consumer rights don't apply online because of the way the law considers normal communication with a server on the internet an excursion onto private property.
We need a modern law addressing these issues instead of the pre-Internet CFAA. Malicious actors should still be punished, and it may be reasonable to still allow a provision for the civil liability (not criminal) of large-scale accidental DoS from poorly-implemented scrapers, but users should be free to choose their own browsing devices -- even if those browsing devices are highly optimized to extract only the specific pieces of data that the user cares about.
This law should also clarify that normal communication over HTTP cannot be punished unless the plaintiff can demonstrate real and serious interruption to their services, that local RAM copies that are never externally transmitted cannot be considered infringing in themselves, that hosting a site on the internet grants an implied copyright license to read and access its content with any HTTP-capable client, and that browsewrap/clickwrap contracts are unenforceable unless the user undertakes a significant relationship with the company, among other things.
Reading the previous thread again, I suppose that many of those against scraping didn't realized they've already lost : with Ghost, Phantom, and now headless Chrome you're going to have a hard time to detect a well built scraper.
Instead of fighting against scrapers that don't want to harm you, maybe it's about time to invest in your robots.txt and cooperate.
You could say that scraping you're website is FORBIDDEN, but come on : if Airbnb can rent houses, I can scrap you site.
>Reading the previous thread again, I suppose that many of those against scraping didn't realized they've already lost : with Ghost, Phantom, and now headless Chrome you're going to have a hard time to detect a well built scraper.
Unfortunately, if you're scraping some data that only has one authoritative data source, they'll know you're scraping them even if they can't distinguish your individual requests from the general traffic.
This is what happened to my company. It didn't stop them from pretending that we were setting their servers on fire, even though they had no way to know whether we were or not since they couldn't distinguish our traffic from that generated by other browsers.
We were scraping only factual data in the which the company cannot hold a copyright interest. Nonetheless, under Ticketmaster v. RMG, just holding a copy of a page in RAM long enough to parse it constitutes infringement (you have to prove fair use, as Google supposedly did in Perfect 10 v. Google, to avoid this).
The difference between yourself and Google/airbnb is that the latter have a lot of money and are trendy technology companies, and you don't and aren't (yet).
The lesson is become really big before someone sues you and the judiciary will be on your side.
It depends on your definition of harm. When your product is what's published on the websites and you regularly find ripoffs of said website publishing your ripped off content, maybe you'd feel differently about it.
I worked on a research project to develop a web-scale "google" for scientific data and we found very interesting things on robots.txt, from "don't crawl us" to "crawl 1 page every other day" or even better "don't crawl unless you're google".
Another thing we noticed is that google's crawler is kind of aggressive, I guess they are in a position to do it.
This is why I think Google's position as the #1 search engine will never go away. Many sites will tell your bot to go away if you're not Google. They don't care if you're building a search engine that will compete with Google.
The current protocols promote data exchange and since websites are primarily designed to be consumed, there is really no way to stop automated requests. Even companies like distilli[1] networks that parse inflight requests have trouble stopping any sufficiently motivated outfit.
I think data should be disseminated and free info exchange is great. If possible, devs should respect website owners as much as possible; although in my experience people seem to be more willing to rip off large "faceless" sites rather than mom&&pops. Both because that is where valuable data is, and it seems more justifiable even if morally gray.
Regardless, the thing I find most interesting is that Google is most often criticized for selling user data/out their users privacy. However, it is oft not mentioned that Googlebot & the army of chrome browsers are not only permitted, but encouraged to crawl all sites except a scant few that gave achieved escape velocity. Sites that wish to protect their data must disallow and forcibly stop most crawlers except google, otherwise they will be unranked. This creates an odd dichotomy where not only does google retain massive leverage, but another search engine or aggregator has more hurdles and less resources to compete.
[1] They protect crunchbase and many media companies.
If you're worried about being a pain in the ass to administrators, with a web-scraper, they probably need to rethink the way they have their website set up.
markpapadakis|9 years ago
If memory serves, the frequency was computed for clusters of pages from the same site, and it depended on how often they were updated(news sites front-pages were in practice different in successive updates, whereas e.g users homepage were not, they rarely were updated), and how resilient the sites were to aggressive indexing (if they ‘d fail or timeout, or it ‘d take longer than expected to download the page contents than what we expected based on site-wide aggregated metrics, we ‘d adjust the frequency, etc).
The crawlers were all draining multiple queues, whereas URLs from the same site would always end up on the same queue(via consistent hashing, based on the hostname’s hash), so a single crawler process was responsible for throttling requests and respecting robots.txt rules for any single site, without need for cross-crawler state synchronisation.
In practice this worked quite well. Also, this was before Google and its PageRank and social networks (we ‘d probably have also considered pages popularity based on PageRank like metrics and social ‘signals’ in the frequency computation, among other variables).
greglindahl|9 years ago
atmosx|9 years ago
elorant|9 years ago
As for robots.txt problem is most sites don’t even have one. Especially e-commerce sites. They also don’t have a sitemap.xml in case you don’t want to hit every url just to find the structure of the site. Being polite in many cases takes a considerable effort.
stummjr|9 years ago
And yes, crawling politely requires a bit of effort from both ends: the crawler and the website.
jsargiox|9 years ago
greglindahl|9 years ago
zeroxfe|9 years ago
minimaxir|9 years ago
cookiecaper|9 years ago
We need a modern law addressing these issues instead of the pre-Internet CFAA. Malicious actors should still be punished, and it may be reasonable to still allow a provision for the civil liability (not criminal) of large-scale accidental DoS from poorly-implemented scrapers, but users should be free to choose their own browsing devices -- even if those browsing devices are highly optimized to extract only the specific pieces of data that the user cares about.
This law should also clarify that normal communication over HTTP cannot be punished unless the plaintiff can demonstrate real and serious interruption to their services, that local RAM copies that are never externally transmitted cannot be considered infringing in themselves, that hosting a site on the internet grants an implied copyright license to read and access its content with any HTTP-capable client, and that browsewrap/clickwrap contracts are unenforceable unless the user undertakes a significant relationship with the company, among other things.
emodendroket|9 years ago
unknown|9 years ago
[deleted]
tangue|9 years ago
Instead of fighting against scrapers that don't want to harm you, maybe it's about time to invest in your robots.txt and cooperate.
You could say that scraping you're website is FORBIDDEN, but come on : if Airbnb can rent houses, I can scrap you site.
cookiecaper|9 years ago
Unfortunately, if you're scraping some data that only has one authoritative data source, they'll know you're scraping them even if they can't distinguish your individual requests from the general traffic.
This is what happened to my company. It didn't stop them from pretending that we were setting their servers on fire, even though they had no way to know whether we were or not since they couldn't distinguish our traffic from that generated by other browsers.
We were scraping only factual data in the which the company cannot hold a copyright interest. Nonetheless, under Ticketmaster v. RMG, just holding a copy of a page in RAM long enough to parse it constitutes infringement (you have to prove fair use, as Google supposedly did in Perfect 10 v. Google, to avoid this).
The difference between yourself and Google/airbnb is that the latter have a lot of money and are trendy technology companies, and you don't and aren't (yet).
The lesson is become really big before someone sues you and the judiciary will be on your side.
FussyZeus|9 years ago
betolink|9 years ago
Another thing we noticed is that google's crawler is kind of aggressive, I guess they are in a position to do it.
Our paper in case someone is interested: Optimizing Apache Nutch for domain specific crawling at large scale (http://ieeexplore.ieee.org/document/7363976/?arnumber=736397...)
AznHisoka|9 years ago
vonklaus|9 years ago
I think data should be disseminated and free info exchange is great. If possible, devs should respect website owners as much as possible; although in my experience people seem to be more willing to rip off large "faceless" sites rather than mom&&pops. Both because that is where valuable data is, and it seems more justifiable even if morally gray.
Regardless, the thing I find most interesting is that Google is most often criticized for selling user data/out their users privacy. However, it is oft not mentioned that Googlebot & the army of chrome browsers are not only permitted, but encouraged to crawl all sites except a scant few that gave achieved escape velocity. Sites that wish to protect their data must disallow and forcibly stop most crawlers except google, otherwise they will be unranked. This creates an odd dichotomy where not only does google retain massive leverage, but another search engine or aggregator has more hurdles and less resources to compete.
[1] They protect crunchbase and many media companies.
libeclipse|9 years ago
novaleaf|9 years ago
It's a bit more "raw" than Scrapinghub but full featured and cheap.
Disclaimer: I'm the author!