(no title)
adamseabrook | 10 years ago
Unlike Blekko we are just capturing the source and dumping it into a DB without doing any analysis. As soon as you start trying to parse anything in the crawl data your hardware requirements go through the roof. parallel with wget or curl is enough to crawl millions of pages per day. I often use http://puf.sourceforge.net/ when I need to do a quick crawl
"puf -nR -Tc 5 -Tl 5 -Td 20 -t 1 -lc 200 -dc 5 -i listofthingstodownload" will easily do 10-20 million pages per day if you are spreading your requests across a lot of hosts.
greglindahl|10 years ago