I can't provide details on any innovations we've done with sites like google, but in general if you want to crawl google you'll want to get "many, many" IP addresses. I've heard of people using services like 2captcha.com but the best way is to obfuscate who you are.
If you can hit Google 60 times per minute per IP before getting blocked and you need to crawl them 1000 times per minute, you need 17 IPs per hour. Randomize headers to look like real people coming from schools, office buildings, etc... Lots of work but possible.
I do it using rotating proxies, stripping cookies between requests, randomly varying the delay between requests, randomly selecting a valid user-agent string, etc. It's a pain in the butt. And to scrape more than I do, faster than I do, would be pretty freaking expensive in terms of time and money.
Note that Google is pretty aggressive about captcha-ing "suspicious" activity and/or throttling responses to suspicious requests. You can easily trigger a captcha with your own manual searching. Just search for something, go to page 10, and repeat maybe 5-20 times and you'll see a captcha challenge.
If Google gets more serious about blocking me then I'll use ML to overcoming their ML (which should be doable because they're always worried about keeping Search consumer-friendly).
wrath|8 years ago
If you can hit Google 60 times per minute per IP before getting blocked and you need to crawl them 1000 times per minute, you need 17 IPs per hour. Randomize headers to look like real people coming from schools, office buildings, etc... Lots of work but possible.
laughfactory|8 years ago
Note that Google is pretty aggressive about captcha-ing "suspicious" activity and/or throttling responses to suspicious requests. You can easily trigger a captcha with your own manual searching. Just search for something, go to page 10, and repeat maybe 5-20 times and you'll see a captcha challenge.
If Google gets more serious about blocking me then I'll use ML to overcoming their ML (which should be doable because they're always worried about keeping Search consumer-friendly).
futhey|8 years ago
bdcravens|8 years ago
sergiotapia|8 years ago