cuDF – GPU DataFrame Library

[+] ashvardanian|1 year ago|reply

cuDF is the most impressive DataFrame implementation I've seen and have been recommending for years. The API is exceptionally close to Pandas (just a couple of different function arguments here and there), much more so than PyArrow or Modin. Throughput and energy efficiency were often 10x that of PyArrow running on a comparable TDP SotA CPU 2 years ago [1].

[1]: https://www.unum.cloud/blog/2022-09-20-pandas

[+] fbdab103|1 year ago|reply

Limited by VRAM is a huge constraint for me. Even if it is slower, being able to load 100GB+ into RAM without any batching headaches is worth a lot.

Unless cudf has implemented some clever dask+cudf kind of situation which can intelligently push data in/out of GPU as required?

[+] xrd|1 year ago|reply

Does this accelerate on an M1? I know this says it is for cuda and that obviously means Nvidia GPU, but lots of ML projects have a port to Apple silicon. I would love to try this on my Mac and see what kind of acceleration my pandas tools get for free.

[+] zamalek|1 year ago|reply

I wish we could commit to not conflating NVIDIA with GPU. It wouldn't hurt a soul to call it "cuDF - NVIDIA DataFrame Library." To answer your question, it will probably run on the CPU.

[+] kiratp|1 year ago|reply

Jax is the closest thing at the moment.

https://developer.apple.com/metal/jax/

And MLX

https://github.com/ml-explore/mlx

[+] killingtime74|1 year ago|reply

From the readme, no. Says NVIDIA drivers required

[+] mgt19937|1 year ago|reply

How does this compare to duckdb/polars? I wonder if GPU based compute engine is a good idea. GPU memory is expensive and limited. The bandwidth between GPU and main memory isn't very much either.

[+] pbib|1 year ago|reply

The same group (Nvidia/Rapids) is working on a similar project but with Polars API compatibility instead of Pandas. It seems to be quite far from completion, though.

See discussion: https://news.ycombinator.com/item?id=39930846

[+] xs83|1 year ago|reply

This and Rapids.ai is the single reason that NVIDIA is the leader in AI.

They made GPU processing at scale accessible to everyone, I have been a long term user of Rapids and found that even as a data engineer I can do things on an old consumer GPU that would otherwise require a 20+ node cluster to do in the same time.

[+] __mharrison__|1 year ago|reply

Even though other tools might be "better" than Pandas, it's ubiquity is why I suggest it.

The ability to run coffee 100-1000x faster with this is just icing on the cake.

(I've run through this with most of my Pandas training material and it just works with no code changes.)

[+] mafro|1 year ago|reply

Out of interest, what other tools might be "better" than Pandas?

I like pandas, and python.

[+] skenderbeu|1 year ago|reply

Is this GPU thing some kind of magic?

We have like 12 different types of it in the wild. I think it's time we came up with a 1 or 2 GPU HW standards similar to how we have for CPUs.

[+] CarRamrod|1 year ago|reply

Is there something comparable to this for Matplotlib?

[+] mwexler|1 year ago|reply

This is actually a good callout. While pipeline speedups of transforms is hugely important, lots of other fundamental python tools for viz, model examination, etc are not built on a different foundation and not optimized by pandas improvements.

I get it: some of these are legacy, others are hand optimized python since default pandas is so slow. But I'm hoping that, over time, we'll improve the runtime of the other stages of analysis too.

[+] hack_ml|1 year ago|reply

There are some integrations for stuff like https://docs.rapids.ai/visualization :

HoloViews hvPlot Datashader Plotly Bokeh Seaborn Panel PyDeck cuxfilter node RAPIDS

29 comments