top | item 10068955

(no title)

mrocklin | 10 years ago

Thanks for the comments. I wrote most of that document so I'll try to explain my reasoning in line.

> First and foremost, it would make more sense to compare against the DataFrame API of Spark, which is very Pandas like.

It would make more sense to me to compare dask.dataframe to spark's dataframe. This document is comparing dask to spark. dask.dataframe is a relatively small part of dask.

>> "Dask gives up high-level understanding to allow users to express more complex parallel algorithms."

> I don't think this is true. Their example of complex algorithms (SVD) is not that complicated, and there are even implementations of that in Spark's MLlib directly. Spark's DAG/RDD API is essentially the low level user-facing task API.

The point here is that it's quite natural for dask users to create custom graphs (here is another example matthewrocklin.com/blog/work/2015/07/23/Imperative/). Doing this in Spark requires digging more deeply into its guts. This sort of work is not idiomatic or much intended in Spark.

> Loading "terabyte" of JSON into Postgres seems pretty painful

I've found that loading a terabyte of CSV into Postgres or a terabyte of JSON into Mongo to be quite pleasant actually. I'd be curious to know what problems you ran into.

discuss

No comments yet.