top | item 34973960

(no title)

robbs | 3 years ago

IMO, this is the hardest part of maintaining a web scraper. We had ~100 scripts to scrape ~1000 clients' sites and it was, at minimum, 50 hours a week to keep up with changes.

The second hardest part was 30% of our clients all used the same hosting provider, which would start to fail at 10-20 req/s. We had to throttle the sites by IP, cluster-wide.

discuss

order

portInit|3 years ago

This makes sense and I am curious about this. Was there consistency between those 1k client sites or were they all rather different? Mind if I reach out?