This is super useful. So much boilerplate code to run async calls and gather. I have been using tqdm.gather() so I am glad to see this library supports it.
Thanks! I originally built this to scratch an itch I had, so I’m really glad you find it useful too. If you have any ideas for improvements or missing features, feel free to suggest them — or even open a PR!
These are two different paradigms. aiopandas is not trying to offload pandas work somewhere else to prevent it from blocking synchronous code, it's trying to let you apply asynchronous functions to pandas operations concurrently while running on the event loop inside of other async code.
That said, this is mostly just going to be helpful if you're running pandas operations that call an external API on each iteration or something, and the actual pandas part of the work is still going to be CPU-bound and block. I am also not a huge fan of the monkey-patching approach. But it's clever and will definitely be useful to folks doing a very specific kind of work
indeed... the longer i write python, the more i just try to solve stuff with a simple ThreadPoolExecutor.
I think doing this is not the best choice for cpu-bound work, which is likely what you're running into with pandas, but nevertheless... I like how you can almost always slap a threadpool onto something and speed things up, with minimal cognitive overhead.
This is a very clean api and I really like the way you implemented it directly in Pandas. I worked on something similar 2 years back but the API was not as this one. Thanks a lot to making this.
dask-labextension runs in JupyterLab and has a parallel plot visualization of the dask task graph and progress through it: https://github.com/dask/dask-labextension
Thank you for the input! To be honest, I don’t use Dask often, and as a regular Pandas user, I don’t feel the most qualified to comment—but here we go.
Can this be merged into Pandas?
I’d be honored if something I built got incorporated into Pandas! That said, keeping aiopandas as a standalone package has the advantage of working with older Pandas versions, which is useful for workflows where upgrading isn’t feasible. I also can’t speak to the downstream implications of adding this directly into Pandas.
Pandas does not install tqdm by default.
That makes sense, and aiopandas doesn’t require tqdm either. You can pass any class with __init__, update, and close methods as the tqdm argument, and it will work the same. Keeping dependencies minimal helps avoid unnecessary breakage.
What about Dask?
I’m not a regular Dask user, so I can’t comment much on its internals. Dask already supports async coroutines (Dask Async API), but for simple async API calls or LLM requests, aiopandas is meant to be a lightweight extension of Pandas rather than a full-scale parallelization framework. If you’re already using Dask, it probably covers most of what you need, but if you’re just looking to add async support to Pandas without additional complexity, aiopandas might be a more lightweight option.
gardnr|11 months ago
eneuman|11 months ago
refactor_master|11 months ago
Why not just something like this?
dkh|11 months ago
That said, this is mostly just going to be helpful if you're running pandas operations that call an external API on each iteration or something, and the actual pandas part of the work is still going to be CPU-bound and block. I am also not a huge fan of the monkey-patching approach. But it's clever and will definitely be useful to folks doing a very specific kind of work
isoprophlex|11 months ago
I think doing this is not the best choice for cpu-bound work, which is likely what you're running into with pandas, but nevertheless... I like how you can almost always slap a threadpool onto something and speed things up, with minimal cognitive overhead.
napsternxg|11 months ago
eneuman|11 months ago
If you have any ideas for improvements, missing features, or run into any issues, don't hesitate to share!
westurner|11 months ago
Pandas does not currently install tqdm by default.
pandas-dev/pandas//pyproject.toml [project.optional-dependencies] https://github.com/pandas-dev/pandas/blob/8943c97c597677ae98...
Dask solves for various adjacent problems; IDK if pandas, dask, or dask-cudf would be faster with async?
Dask docs > Scheduling > Dask Distributed (local) https://docs.dask.org/en/stable/scheduling.html#dask-distrib... :
> Asynchronous Futures API
Dask docs > Deploy Dask Clusters; local multiprocessing poll, k8s (docker desktop, podman-desktop,), public and private clouds, dask-jobqueue (SLURM,), dask-mpi: https://docs.dask.org/en/stable/deploying.html#deploy-dask-c...
Dask docs > Dask DataFrame: https://docs.dask.org/en/stable/dataframe.html :
> Dask DataFrames are a collection of many pandas DataFrames.
> The API is the same. The execution is the same.
> [concurrent.futures and/or @dask.delayed]
tqdm.dask: https://tqdm.github.io/docs/dask/#tqdmdask .. tests/tests_pandas.py: https://github.com/tqdm/tqdm/blob/master/tests/tests_pandas.... , tests/tests_dask.py: https://github.com/tqdm/tqdm/blob/master/tests/tests_dask.py
tqdm with dask.distributed: https://github.com/tqdm/tqdm/issues/1230#issuecomment-222379... , not yet a PR: https://github.com/tqdm/tqdm/issues/278#issuecomment-5070062...
dask.diagnostics.progress: https://docs.dask.org/en/stable/diagnostics-local.html#progr...
dask.distributed.progress: https://docs.dask.org/en/stable/diagnostics-distributed.html...
dask-labextension runs in JupyterLab and has a parallel plot visualization of the dask task graph and progress through it: https://github.com/dask/dask-labextension
dask-jobqueue docs > Interactive Use > Viewing the Dask Dashboard: https://jobqueue.dask.org/en/latest/clusters-interactive.htm...
https://examples.dask.org/ > "Embarrassingly parallel Workloads" tutorial re: "three different ways of doing this with Dask: dask.delayed, concurrent.Futures, dask.bag": https://examples.dask.org/applications/embarrassingly-parall...
eneuman|11 months ago
Can this be merged into Pandas?
I’d be honored if something I built got incorporated into Pandas! That said, keeping aiopandas as a standalone package has the advantage of working with older Pandas versions, which is useful for workflows where upgrading isn’t feasible. I also can’t speak to the downstream implications of adding this directly into Pandas.
Pandas does not install tqdm by default.
That makes sense, and aiopandas doesn’t require tqdm either. You can pass any class with __init__, update, and close methods as the tqdm argument, and it will work the same. Keeping dependencies minimal helps avoid unnecessary breakage.
What about Dask?
I’m not a regular Dask user, so I can’t comment much on its internals. Dask already supports async coroutines (Dask Async API), but for simple async API calls or LLM requests, aiopandas is meant to be a lightweight extension of Pandas rather than a full-scale parallelization framework. If you’re already using Dask, it probably covers most of what you need, but if you’re just looking to add async support to Pandas without additional complexity, aiopandas might be a more lightweight option.