top | item 29678046

(no title)

hackingforfun | 4 years ago

In terms of NodeJS vs Python, specifically for web scraping, would you choose NodeJS? If so, why?

I'm more familiar with NodeJS but I'm working with a team that is leaning towards using Python for web scraping, so that's why I'm asking. They said spinning up multiple processes in Python is easier so at scale it will work better.

I know you can use the Cluster module to have child processes in NodeJS, but in my experience it's a bit of a pain to use, although it's not always required to use anyway, at least when only using NodeJS as a web server (as long as you have multiple NodeJS instances, in case one goes down). Web scraping is a bit different though.

Curious if you have any thoughts on this.

discuss

order

remram|4 years ago

NodeJS (V8) is faster than CPython, and NodeJS was built for this precise use case. I wouldn't use it over Python personally though, I don't use NodeJS and would sooner reach for Rust if performance mattered.

Concurrency is an important limitation as you've noticed, but it's already a problem for CPython. You would be able to squeeze out more req/s from NodeJS than CPython, up to a point where you would need to bring in something extra to scale to all the cores of one machine (multiprocessing in Python, something like Cluster in NodeJS) which you wouldn't need in Go/Rust/Java.

Then of course scaling further, you would need a system to run jobs across machines, and your choice of Go over Python wouldn't necessarily matter so much. The difference in performance wouldn't limit what you can do, it would just change what you pay for compute. If your compute costs more but your devs can implement features faster, performance is usually unimportant.

dilatedmind|4 years ago

I worked on a project which required some medium scale web scraping (less than 100 million pages), and went with node primarily because of puppeteer.

The system had a couple dozen worker processes doing the scraping, and one coordinator which maintained a queue of pages which needed to be scraped. There was some logic to balance requests between sites, so we weren't making more than a request/s to any in particular. The coordinator just had a rest api endpoint, which the workers would hit to get their next job and to return w/e data.

Each worker process was ran on a separate aws instance, believe it was a t2 with unlimited cpu enabled. These are only a few dollars a day, and it was necessary to have as many ip addresses as possible (at least 5% of the sites we were scraping had some preventative measures in place, but they all seemed to be ip based)

anaxag0ras|4 years ago

> Each worker process was ran on a separate aws instance, believe it was a t2 with unlimited cpu enabled

I wonder if these kind of processes are cheaper on Lambda.

huetius|4 years ago

I’ve never done web scraping professionally, but I can relay that any time I’ve tried to use Python for anything involving concurrency, except in the most trivial cases, it has been pure pain.