top | item 25375162

(no title)

interesting, how did the spend breakdown between cloud run and firebase?

did you have any limit to how many req/s you made to an individual site? It seems this would be difficult to implement with this architecture.

how did you deal with following links in circles/ avoiding scraping the same page multiple times?

I had built something similiar at a previous job, recursively scraping ecommerce sites. The first thing I noticed was some of the sites we were scaping couldn't handle more than a couple requests a second (in particular as we scaped uncached pages by sites running php). Other sites were quick to ip ban.

I kept things simple, a few dozen micro instances on aws (think they were like $3 a day) running puppeteer. A single server acting as a controller, keeping a per site queue and allowing us to set per site request limits if necessary. All the state of which links were already seen just kept in memory. Of course everything was also persisted to a db, and if the controller process needed to be restarted, it could restore the queue/ seen state and resume.

discuss

No comments yet.