top | item 35346858

(no title)

kvnkho | 2 years ago

Hi whinvik, we agree that development in Spark is hard, and that is part of the motivation of Fugue. Spark code couples the distributed orchestration and business logic together.

By keeping your code in native Python or Pandas, it will be much easier to develop, debug, and maintain the business logic because your tracebacks will be in native Python. Fugue then takes it to Spark when you are ready to scale.

discuss

order

whinvik|2 years ago

I appreciate your response but that is not what I was getting at. I understand that with this you only have to write Pandas and then not worry about scaling.

First, I think PySpark syntax is much better than the insanity than is Pandas but if you really like Pandas then you can always use Pandas UDF which Spark supports.

But let's say that writing only in Pandas is the preferred way. Now comes the magic part. How do I know that it is using the best join? Will it optimize for spills? Will there be OOM's? These are the things we need to worry about which often lead us needing to go deep inside Spark magic.

Now if there's another level of magic which is Pandas to Spark transpiling as I imagine you do here, then I have even less of an idea how to tune it.

Again I appreciate you are solving a specific problem in a nice way but I feel like we are actually making the problem even more complicated.