(no title)
mynameisash | 3 months ago
I'm surprised by how often people jump to Spark because "it's (highly) parallelizable!" and "you can throw more nodes at it easy-peasy!" And yet, there are so many cases where you can just do things with better tools.
Like the time a junior engineer asked for help processing 100s of ~5GB files of JSON data which turned out to be doing crazy amounts of string concatenation in Python (don't ask). It was taking something like 18 hours to run, IIRC, and writing a simple console tool to do the heavy lifting and letting Python's multiprocessing tackle it dropped the time to like 35 minutes.
Right cool for the right job, people.
esafak|3 months ago
rmnclmnt|3 months ago
jellyfishbeaver|3 months ago
nijave|3 months ago
Wrangling multiprocess is still annoying tho
rgblambda|3 months ago
benrutter|3 months ago