top | item 45922472

(no title)

> It seems like these single-node libraries can process a terabyte on a typical machine, and you'd have have over 10TB before moving to Spark.

I'm surprised by how often people jump to Spark because "it's (highly) parallelizable!" and "you can throw more nodes at it easy-peasy!" And yet, there are so many cases where you can just do things with better tools.

Like the time a junior engineer asked for help processing 100s of ~5GB files of JSON data which turned out to be doing crazy amounts of string concatenation in Python (don't ask). It was taking something like 18 hours to run, IIRC, and writing a simple console tool to do the heavy lifting and letting Python's multiprocessing tackle it dropped the time to like 35 minutes.

Right cool for the right job, people.

discuss

esafak|3 months ago

I used pySpark some time ago when it was introduced to my company at the time and I realized that it was slow when you used python libraries in the UDFs rather than pySpark's own functions.

rmnclmnt|3 months ago

Yes using Python UDFs within Spark pipelines are a hog! That’s because the entire Python context is serialized with cloudpickle and sent over the wire to the executor nodes! (It can represent a few GB of serialized data depending on the UDF and driver process Python context)

jellyfishbeaver|3 months ago

We actually baked a rule to catch UDF usage into our Python linter. Almost always, a UDF can be refactored to use only native PySpark functions.

nijave|3 months ago

Python isn't too bad if you swap in orjson instead of stdlib which is quite a bit slower

Wrangling multiprocess is still annoying tho

rgblambda|3 months ago

I think Spark was the best tool out there when data engineering started taking off, and it just works (provided you don't have to deal with jar dependency hell) so there's not a huge incentive to move away from it.

benrutter|3 months ago

This is so true! Even a few years ago, these benchmarks would have been against pandas (instead of polaes and duckdb) and would likely have looked very different.