top | item 10224921

(no title)

meanpath.com can do around 200 million pages per day using 13 fairly average dedicated servers. We only crawl the front page (mile wide, inch deep) so the limiting factor is actually DNS. Looking at the network traffic the bandwidth is split evenly between DNS and HTTP. Google public DNS will quickly rate limit you so you need to use your own resolvers (we use Unbound).

Unlike Blekko we are just capturing the source and dumping it into a DB without doing any analysis. As soon as you start trying to parse anything in the crawl data your hardware requirements go through the roof. parallel with wget or curl is enough to crawl millions of pages per day. I often use http://puf.sourceforge.net/ when I need to do a quick crawl

"puf -nR -Tc 5 -Tl 5 -Td 20 -t 1 -lc 200 -dc 5 -i listofthingstodownload" will easily do 10-20 million pages per day if you are spreading your requests across a lot of hosts.

discuss

greglindahl|10 years ago

We used djbdns on every crawl machine, and did not find DNS to be limiting at all. You should also make sure there isn't any connection tracking, or firewalls/middleboxes which are doing connection-based anything, or NAT, or really anything other than raw Internet between you and the Internet.