top | item 17497397

(no title)

livando | 7 years ago

"Multi-threading is a must, when scraping at scale."

I disagree on this point. Starting with a single threaded model allowed my team to scale quickly and with little additional overhead. What we have lost with performance we gained in simplicity and developer productivity. That being said tuning and porting portions of the app to a multi-threaded system is slotted to take place within the next year.

Start with single threaded and simple, move to multi-threaded scrapers when the juice is worth the squeeze.

discuss

order

pdimitar|7 years ago

Or use a language where fully utilizing all CPU cores is transparent, like Elixir? There's zero complexity, you basically add 4-5 lines of code and that's it. Honestly, not exaggerating.

I've done several very amateur scrapers in the last several years, I am never going back to languages with a global interpreter lock, ever.

iooi|7 years ago

I'm assuming you're talking about Python, which is also "4-5 lines" to use multithreading or multiprocessing. Can you explain what's wrong with GIL languages?

Now that I think about it, it's even less than 4 lines:

from multiprocess.pool import Pool (or ThreadPool)

pool = Pool()

pool.map(scrape, urls)

detaro|7 years ago

Any further information on this? Last I looked (which was a while ago), the infrastructure like HTML parsers seemed surprisingly tricky in Elixir.