You'd have to scrape slowly to mimic a real slow user. Maybe at that point you'd be cheaper to get Mechanical Turk to do it. That should solve IP rate limiting, captchas, and just about everything except the endless arms race. Why are so many people going directly to these same-formatted internal URLs without clicking through from random other places? So the site can change the internal URLs and break it all over again.
toomuchtodo|8 years ago
Recap [1] does this to extract PACER court documents that are public domain, but access is restricted due to draconian public policy.
[1] https://free.law/recap/
figgis|8 years ago
Sure, but that's easily mitigated by running multiple scrapers as different users.. You don't need to get all the data from a single scrape.