top | item 16180839

(no title)

averagewall | 8 years ago

You'd have to scrape slowly to mimic a real slow user. Maybe at that point you'd be cheaper to get Mechanical Turk to do it. That should solve IP rate limiting, captchas, and just about everything except the endless arms race. Why are so many people going directly to these same-formatted internal URLs without clicking through from random other places? So the site can change the internal URLs and break it all over again.

discuss

order

toomuchtodo|8 years ago

You'd use a browser extension, scoped to requests of sites you're interested in, and stream your data back to your infrastructure for processing. You're limited only by your install base and your ingest infrastructure.

Recap [1] does this to extract PACER court documents that are public domain, but access is restricted due to draconian public policy.

[1] https://free.law/recap/

figgis|8 years ago

>You'd have to scrape slowly to mimic a real slow user.

Sure, but that's easily mitigated by running multiple scrapers as different users.. You don't need to get all the data from a single scrape.