top | item 29260492

Dask – A flexible library for parallel computing in Python

232 points| gjvc | 4 years ago |dask.org | reply

85 comments

order
[+] MrPowers|4 years ago|reply
I recently switched from Spark => Dask cause I like it so much. Disclosure: work for Coiled, the company founded by Matt Rocklin, the creator of Dask.

Here's what I've found thus far with Dask:

* way easier to build custom distributed compute systems with Dask than other alternatives. Dask futures & delayed APIs give you access to the "engine" so you can build your own custom race car

* Lots of data scientists are a lot more productive with Dask compared to Java / Scala technologies they're not comfortable with (I have a popular Spark blog / Spark books & love Spark, but some ppl just can't get productive with Spark)

* Dask cluster visualizations are so nice. So easy to understand how clusters are being computed in real time.

* We've been working closely with NVIDIA folks to provide real cutting edge GPU support.

I'm a believer in multi-tech data pipelines in the future. A data engineering team that loves Scala may write some ETL code in Spark and then pass off the baton to a data science team that loves Python and uses Dask to train machine learning models.

Seems like most other players in the data space want to take over an organization's entire data platform. I like how Dask likes to play nicely as part of the overall PyData ecosystem. I've always liked the Unix philosophy of building little tools that can be easily composed to solve a variety of different problems and feel at home in the PyData ecosystem.

[+] faizshah|4 years ago|reply
Dask is really good for scientists, when I worked with bioinformaticians it was way easier to get them to use dask for out of memory processing than spark. In particular dask bag offers a great amount of flexibility for ETL use cases.

The problem with dask for me though (as a user of dask and prefect) is that I was never able to get the throughput of pySpark out of dask. For dask I can reach 100 mb/s on my laptop while pyspark can each 260 mb/s on my laptop for the same workload (cleaning and restructuring). This was around 2 years ago that I tested that I’m curious what the community has experienced now.

Also for most use cases your data isn’t really big data and dask is much simpler and fast enough which is why I reach for dask first.

[+] deepsun|4 years ago|reply
I found out that Dask is like half-done Spark. Dask kinda works with regular dataframes, but there are just too many inconsistencies that almost always Dask breaks on a working Pandas algorithm. It's just easier to install PySpark and work with it. IMHO.
[+] semi-extrinsic|4 years ago|reply
I tried to use Dask to do a formst conversion of a 5 TB binary dataset coming from astrophysical simulation. It should have been a trivial job, since it was a single huge 3D array coming from a lot of files that were slicing the array in one direction, and I needed to do a conversion in each point, do a couple of transposes and reshapes, and then output a single file.

After two weeks tearing my hair out at the appalling speed, I scrapped that solution and spent one day throwing together a pipeline of classical shell tools (dd, sed, xxd etc.), orchestrated using GNU Parallell, and it did the whole job in three hours.

[+] JPKab|4 years ago|reply
I'm not making any claims about dask being super reliable or easy to use, but let's not pretend that a spark cluster is a trivial thing to set up and deal with.

I've had to administer spark clusters in the past, and there's a good reason why databricks is a thing.

Another detail that's important is that spark is dramatically overkill for the data sets dask is meant to work on. Dask is really geared for data sets that can't fit in memory but not necessarily data sets that can't fit on disk that spark was really meant for.

[+] ritoune|4 years ago|reply
Yes. I'm not a fan of Spark, dealing with JVM, new syntax everything, optimizing parallelism in a weird way but - it always works.

Dask, on the other hand, works some of the time. The rest of the time it'll keep running a calculation forever, or simply fail silently over and over, or some other unpleasant outcome.

[+] faizshah|4 years ago|reply
I had a similar experience, I usually avoid the dask dataframe where possible and instead use bag and dask delayed. But its hard to get scientists to give up the dataframe mindset. Thats one of the reasons I wanted to try modin, I have heard the dataframe in modin is a bit easier to use.
[+] rjzzleep|4 years ago|reply
I was mainly looking at Dask for the purpose of a more better Pandas DataFrame, but it seems like it's actually not that much better, just more distributed. It's like Dask is trying to do too many things and there doesn't do most of them well enough.

While I haven't tried it, vaex[1] seems much better, since it specifically to only address one pain point, which is to address the inefficiencies of the pandas DataFrame and aims to do that one thing well.

[1] https://vaex.io/docs/index.html

[+] jjoonathan|4 years ago|reply
So much this. Dask does not remove the need to babysit workers and partitions, and that's the real difference between pandas and spark.
[+] turbocon|4 years ago|reply
It's been over a year now, but I had an identical experience.
[+] marcinzm|4 years ago|reply
Our experience with Dask in distributed mode has been really painful to be honest. There's ways to shoot yourself in the foot that lead to random data corruption like changing a dataframe index with map_partitions without resetting the index afterwards. Not failure but successful runs with corrupted data. Sometimes your job gets bottlenecked on a single worker for no apparent reason so you cancel and rewrite it in a slightly different way and then it works fine. We had bugs in two of the last four versions that prevented our pipeline from running properly. If you have a dataframe of numpy arrays then unmanaged memory leaks so you need to convert them to python lists first. Again no warning by dask about this. Getting things to run efficiently requires a lot of thought about partitions, the operations you'll be doing, what indexes you have and so on.
[+] riedel|4 years ago|reply
We haven't seen any corruptions yet but sure there are some strange things that aren't well documented. One thing that hit us was that parquet could only been read using dask if you use the same backend. If you start with thousands of files partitioning is IMHO tricky because using dask naively will create task graphs that can be multiple gigabytes large. We also had a lot of dead lock situations. This is where intuitivity shoots you into the foot because your average data scientist has no clue what to do.

Fortunately many dead lock have been fixed in one of the latest releases. Other than that I have been really happy and thankful that dask exists. One othe major use cases is, however, that my colleagues use it as a simple joblib backed for our HPC cluster. Even if it sounds stupid: dask is the easiest way to get data scientists to use distributed computing resources (Disclaimer: I got active in the dask jobqueue project to get stuff fixed to run on HTCondor)

[+] deshpand|4 years ago|reply
We use dask heavily, along with rest of the pydata ecosystem. I guess we are in the 'sweet spot' where data doesn't fit memory, to begin with, but once we perform any filtering and aggregations, switch over to pandas. That's exactly what dask recommends too. Our datasets don't exceed 100GB right now.

Also note, dask clearly acknowledges challenges dealing with data in the terabyte range https://coiled.io/blog/dask-as-a-spark-replacement/

Most of our use-cases right now involve using multiple cores of a big instance, than resorting to cluster computing.

With spark, there is additional/steep learning curve, complexities of dealing with cluster computing. And Spark-ML is not well known. With dask/pandas it's easy enough to feed scikit-learn and/or bring in dask-ml, just a pip install, and you can scale well known sklearn modules effectively.

I think in the end, it's about keeping things simple. As others said, if you are already invested big in Spark/Scala/Hadoop, that may make sense for you. For non-CS folks, this will be a challenge.

As for vaex, it's very interesting. One issue is that it seems to be able to want hdf5 and doesn't want to work with parquet. And it's API is not fully compatible with pandas.

Ray/Modin: played with it a bit and maybe it's a bit too new for enterprise uses and may be more geared for ML workloads. That's my take anyway and it may have progressed substantially, already.

[+] linkdd|4 years ago|reply
Dask and especially the dashboard is a very nice piece of technology.

But "naïve programming" (the art of writing a dumb algorithm with no regards to performance or optimizations) can bite you hard: https://docs.dask.org/en/latest/best-practices.html#avoid-ve...

I implemented an algorithm that was slower than the non-parallelized version this way :D

[+] otherme123|4 years ago|reply
An anecdote that cannot be extrapolated: I did a course on using Dask a few years ago (Dask was fresh then). They teach us how Dask would improve our calculations, to the point of almost promissing a time improvement of "if you chop your dataframe in 8 partitions, your times will be divided by 8", no questions asked.

Then they gave us a sample of a dataframe, to analyze firstly with Pandas, and then with Dask, and compare. The times not only didn't improve, or even equal Pandas: they got worse by more than 10x. The teachers looked the code with a mix of disbelief and horror, and only after that they started to talk about tradeoffs, overheads, data fitting in memory, etc.

[+] lettergram|4 years ago|reply
Having used both ray, dask, and writing custom threads, my personal view is that while there are advantages I wouldn’t want to use any of these unless absolutely necessary.

My personal approach for most of these tasks are to try to break down the problem to be as asynchronous as possible. Then you can create threads.

The nice thing about dask is really the way you can effectively use it on pandas dataframe; Without any large overhead.

Having said that, we opted to write our own parallelization for this library:

https://github.com/capitalone/DataProfiler

As opposed to using the dask frame. Effectively, the dataframe processing wasn’t the bottleneck and easier to maintain the threading ourselves given the particular approaches taken.

That said, if I was working with large pandas dataframes, id likely use dask. For large datasets which couldn’t be stored in memory of use ray.io

[+] physicsguy|4 years ago|reply
To me, Dask is the solution to the wrong question. Rearchitecting the data loading + processing is better than trying to force data frames into a distributed world and will perform a lot better too.
[+] lbrindze|4 years ago|reply
The integration with xarray[0] and the Pangeo[1] community at large are pretty good. Not good enough for sub second response on large datasets but very good for analytic workloads, especially when you are dealing with larger than memory datasets.

If you want fast(ish) performance but you can wait a few minutes or longer its a great tool. Climate scientists love it because it lets them focus on their problem domain instead of focusing on dealing with developing parallel software.

If you are using it to magically speedup your api backend you will probably be disappointed, but you will also be just as disappointed trying to use a jackhammer to hammer in a nail.

[0] https://xarray.pydata.org/en/stable/ [1] https://pangeo.io/

[+] a_square_peg|4 years ago|reply
I rely heavily on Dask/Pangeo stack to serve time-series weather data via Rest API primarily based on ERA5. You’re correct, it won’t give you sub-second responses but turns out this is more than sufficient for data analysis work.
[+] itamarst|4 years ago|reply
If you read through the comments you'll find some contradictory explanations of what Dask does, so as some background:

At its bottom layer, Dask takes a graph of tasks and dispatches them to a scheduler. Schedulers can range from a few threads, a few processes (using multiprocessing), a few process on local machine (with the Distributed scheduler), or tens of thousands of machines.

That latter example is not made up, I've seen geoscience demos doing that in 3 lines of code on one of their giant clusters.

The task graph can be generic Python functions, using e.g. the bag API, which gives you a map-reduce-y like API: https://docs.dask.org/en/latest/bag.html

Separate, and optionally, Dask also has emulation layers for NumPy, Pandas, and scikit-learn I believe as separate project. Here you write code that looks like normal Pandas code, say, but instead of executing immediately it creates a execution graph, and then you submit the graph to your scheduler and it runs it for you. These emulation APIs are partial, by their nature, but also optional.

One interesting use case for this emulation API is processing larger-than-memory datasets. In ordrer to support multiple processes, and even more so multiple machines, Dask reimplements the NumPy and Pandas APIs using batching underneath. And so you can run a single machine and take advantage of that batching to process data that wouldn't fit in memory when using normal Pandas APIs, while also getting access (optionally) to multiple CPUs: https://pythonspeed.com/articles/faster-pandas-dask/

(If you specifically want Pandas APIs, there are other alternatives mentioned here in there in comments, e.g. Modin.)

My main personal experience was using the lower-level API to run image processing code in parallel, on a single machine with multiple worker processes. It worked great.

[+] jp0d|4 years ago|reply
Apologies for being pessimistic. I've never had a reason to use dask. I like Pandas only for very small sample datasets. Nothing more than a few MBs. I deal with large amount of alpha-numeric data that can no way fit on a single computer. Even with multi-processing abilities, it's never really that useful and pandas does the job. I've to use large clusters and spark for any real datasets before I even think about using pandas. Spark has its quirks but I'd argue that it has a lot more features when it comes to distributed computing.
[+] cowsandmilk|4 years ago|reply
I don't think that's being pessimistic. I think it reflects dask's documentation. Some items I looked up because I remember from the past:

> If your data fits comfortably in RAM and you are not performance bound, then using NumPy might be the right choice. Dask adds another layer of complexity which may get in the way.

from https://docs.dask.org/en/latest/array-best-practices.html#us...

> In many workloads it is common to use Dask to read in a large amount of data, reduce it down, and then iterate on a much smaller amount of data. For this latter stage on smaller data it may make sense to stop using Dask, and start using normal Python again.

https://docs.dask.org/en/latest/best-practices.html#stop-usi...

So, dask even says not to use it for small data sets. And in their comparison to spark, they say spark has a lot more features.

[+] nerdponx|4 years ago|reply
Data in the 1-15 GB range is more common than you'd think. Spark is overkill and comparatively a nightmare to configure. You can fit 15 GB in memory nowadays on a single core, but you probably want things to move a little quicker.
[+] JPKab|4 years ago|reply
Spark is massive overkill for the vast majority of needs. Can't fit the data on a computer, like your use case? Yep, good for that.

Anything else, and it's not worth the overhead. Dask doesn't require a JVM cluster with executors and name nodes and hdfs and everything else that keeps databricks in business.

[+] rosej|4 years ago|reply
We have been using dask to support our computational pathology workflows [1], where the images are so big that they cannot be loaded in memory, let alone analyzed (standard pathology whole slide images are ~1GB; some microscopy techniques generate images >1TB). We divide each image into a bunch of smaller tiles and process each tile independently. The dask.distributed scheduler lets us scale up by distributing the tile processing across a cluster.

Benefits of dask.distributed: easy to get up and running, and has support for spinning up clusters on lots of different computing platforms (local machines, HPC cluster, k8s, etc.)

One difficulty is optimizing performance - there are so many configuration details (job size, number of workers, worker resources, etc. etc.) that it's been hard to know what is best.

[1] https://github.com/Dana-Farber-AIOS/pathml

[+] gsteinb88|4 years ago|reply
Just FYI, Amazon and most other providers offer instances that go up to many TB of ram — the largest listed on EC2 right now is 24TB.

Depending on your workload, you may get far better throughput not worrying about distributing the work and data

[+] brutus1213|4 years ago|reply
I started looking at Dask recently... its scheduler/worker mechanism reminded me of the old Docker Swarm. I liked that!

But when I moved to a Kubernetes deployment (set up via the community provided Helm charts), I was completely mystified and got slowdowns as reported by others. At one point my physical worker node was using 1 core, and at other times, it was using all cores. When I dug into the default config they provide, it should have been 2 vcpus. So .. all in all .. I was completely mystified what was going on under the hood, which speaks to the lack of debugging mentioned by people.

[+] Boxxed|4 years ago|reply
I would not recommend Dask. We use it just for simple job scheduling (that is, none of its fancy data structures) and run into issues just getting the work done efficiently. This issue, for instance, keeps the cluster from actually being utilized fully: https://github.com/dask/distributed/issues/4501. I feel like I'm on crazy pills, because it seems pretty serious yet it's gotten no attention.
[+] VHRanger|4 years ago|reply
FWIW, I'm moving towards [vaex](https://vaex.io/docs/) for out of core python stuff

Dask is well built but is fundamentally flawed by tacking together pandas structures with multiprocessing, rather than having a fundamentally efficient underlying structure.

[+] JPKab|4 years ago|reply
Vaex is amazing. I haven't used it in production but was blown away when I did some prototyping with it.
[+] orasis|4 years ago|reply
I love Dask for parallel computing because instead of threads or agents, you’re primarily using distributed data structures like distributed DataFrames, Series, and Bags.

It’s made our code very small, clean, and bug free.

[+] counters|4 years ago|reply
A lot of the commentary here focuses on comparison with tabular/dataframe based analyses, but one of the killer use cases for Dask is more generalized out-of-core array computing. It greatly simplifies many of the forms of dimensional reductions or aggregations that one might perform on n-dimensional data cubes, without worrying about the boilerplate and cognitive overhead of blocking out your calculation and keeping track of all the intermediates. And in the fact that you can accomplish the reduction out-of-core and you've got an incredibly useful tool.

Is it the perfect tool for all applications? Absolutely not. But for many of my workloads as a geoscientist it allows me to trivially matriculate a toy/prototype code to something at "scale" (maybe not for production, but certainly for research) with very little effort. That frees me up to work on other problems.

[+] karjudev|4 years ago|reply
Used Dask for various research experiments and scripts, and it's very good when you have to handle larger-than-memory datasets. When coming to algorithm parallelization, some quirks emerged, and debugging sent me completely nuts. I love Dask, but I think that the problem of parallelizing Python code is not solvable by Dask alone.
[+] ehsantn|4 years ago|reply
Dask's architecture is unable to handle large scale data processing tasks (e.g. ETL, dataframe operations) because it's just a distributed task scheduler. It works as long as tasks have very little communication across them. But data processing tasks like table join need heavy communication (e.g. shuffle) which requires a true parallel architecture like MPI to be done efficiently. Almost all of Dask's problems mentioned here go back this issue.

Bodo is a new compute engine that brings true parallel computing with MPI to data processing. Bodo is over 100x faster than Dask for large-scale data processing. Forget about speed, are you willing to pay AWS/Azure/GCP 100 times more than necessary (also increasing your carbon footprint)?

Bodo uses a new inferential JIT compiler technology which requires getting used to but it handles actual Pandas APIs (not "Pandas-like").

Disclaimer: I work for Bodo.

[+] jmakov|4 years ago|reply
Recently discovered ray.io. Beats multiprocessing and can integrate with dask.
[+] gjvc|4 years ago|reply
YMMV. dask with ray is 10% slower than dask with multiprocessing ray on my workload.
[+] spicyramen|4 years ago|reply
What's your experience with Ray
[+] kzrdude|4 years ago|reply
Dask is useful on a single laptop as well, it can be used to turn a computation that's too large for memory in xarray, to a chunked one that just crunches through, with the same code.
[+] gjvc|4 years ago|reply
modin + ray would have been easier, and not required a daskification of the code, because modin's stated aim is complete compatibility with the pandas API. still, I've managed to limit it to one module in the project, and I am planning to try modin + ray vs dask to see if it is faster for this dataset.