top | item 29679418

(no title)

I worked on a project which required some medium scale web scraping (less than 100 million pages), and went with node primarily because of puppeteer.

The system had a couple dozen worker processes doing the scraping, and one coordinator which maintained a queue of pages which needed to be scraped. There was some logic to balance requests between sites, so we weren't making more than a request/s to any in particular. The coordinator just had a rest api endpoint, which the workers would hit to get their next job and to return w/e data.

Each worker process was ran on a separate aws instance, believe it was a t2 with unlimited cpu enabled. These are only a few dollars a day, and it was necessary to have as many ip addresses as possible (at least 5% of the sites we were scraping had some preventative measures in place, but they all seemed to be ip based)

discuss

anaxag0ras|4 years ago

> Each worker process was ran on a separate aws instance, believe it was a t2 with unlimited cpu enabled

I wonder if these kind of processes are cheaper on Lambda.