top | item 40556687

cuDF – GPU DataFrame Library

107 points| tosh | 1 year ago |github.com | reply

29 comments

order
[+] ashvardanian|1 year ago|reply
cuDF is the most impressive DataFrame implementation I've seen and have been recommending for years. The API is exceptionally close to Pandas (just a couple of different function arguments here and there), much more so than PyArrow or Modin. Throughput and energy efficiency were often 10x that of PyArrow running on a comparable TDP SotA CPU 2 years ago [1].

[1]: https://www.unum.cloud/blog/2022-09-20-pandas

[+] fbdab103|1 year ago|reply
Limited by VRAM is a huge constraint for me. Even if it is slower, being able to load 100GB+ into RAM without any batching headaches is worth a lot.

Unless cudf has implemented some clever dask+cudf kind of situation which can intelligently push data in/out of GPU as required?

[+] xrd|1 year ago|reply
Does this accelerate on an M1? I know this says it is for cuda and that obviously means Nvidia GPU, but lots of ML projects have a port to Apple silicon. I would love to try this on my Mac and see what kind of acceleration my pandas tools get for free.
[+] zamalek|1 year ago|reply
I wish we could commit to not conflating NVIDIA with GPU. It wouldn't hurt a soul to call it "cuDF - NVIDIA DataFrame Library." To answer your question, it will probably run on the CPU.
[+] killingtime74|1 year ago|reply
From the readme, no. Says NVIDIA drivers required
[+] mgt19937|1 year ago|reply
How does this compare to duckdb/polars? I wonder if GPU based compute engine is a good idea. GPU memory is expensive and limited. The bandwidth between GPU and main memory isn't very much either.
[+] pbib|1 year ago|reply
The same group (Nvidia/Rapids) is working on a similar project but with Polars API compatibility instead of Pandas. It seems to be quite far from completion, though.

See discussion: https://news.ycombinator.com/item?id=39930846

[+] xs83|1 year ago|reply
This and Rapids.ai is the single reason that NVIDIA is the leader in AI.

They made GPU processing at scale accessible to everyone, I have been a long term user of Rapids and found that even as a data engineer I can do things on an old consumer GPU that would otherwise require a 20+ node cluster to do in the same time.

[+] __mharrison__|1 year ago|reply
Even though other tools might be "better" than Pandas, it's ubiquity is why I suggest it.

The ability to run coffee 100-1000x faster with this is just icing on the cake.

(I've run through this with most of my Pandas training material and it just works with no code changes.)

[+] mafro|1 year ago|reply
Out of interest, what other tools might be "better" than Pandas?

I like pandas, and python.

[+] skenderbeu|1 year ago|reply
Is this GPU thing some kind of magic?

We have like 12 different types of it in the wild. I think it's time we came up with a 1 or 2 GPU HW standards similar to how we have for CPUs.

[+] CarRamrod|1 year ago|reply
Is there something comparable to this for Matplotlib?
[+] mwexler|1 year ago|reply
This is actually a good callout. While pipeline speedups of transforms is hugely important, lots of other fundamental python tools for viz, model examination, etc are not built on a different foundation and not optimized by pandas improvements.

I get it: some of these are legacy, others are hand optimized python since default pandas is so slow. But I'm hoping that, over time, we'll improve the runtime of the other stages of analysis too.