(no title)
dilatedmind | 4 years ago
The system had a couple dozen worker processes doing the scraping, and one coordinator which maintained a queue of pages which needed to be scraped. There was some logic to balance requests between sites, so we weren't making more than a request/s to any in particular. The coordinator just had a rest api endpoint, which the workers would hit to get their next job and to return w/e data.
Each worker process was ran on a separate aws instance, believe it was a t2 with unlimited cpu enabled. These are only a few dollars a day, and it was necessary to have as many ip addresses as possible (at least 5% of the sites we were scraping had some preventative measures in place, but they all seemed to be ip based)
anaxag0ras|4 years ago
I wonder if these kind of processes are cheaper on Lambda.