The idea that this is a drop in replacement for numpy (e.g., `import cupy as np`) is quite nice, though I've gotten similar benefit out of using `pytorch` for this purpose. It's a very popular and well-supported library with a syntax that's similar to numpy.
However, the AMD-GPU compatibility for CuPy is quite an attractive feature.
So it's possible to write array API code that consumes arrays from any of those libraries and delegate computation to them without having to explicitly import any of them in your source code.
The only limitation for now is that PyTorch (and to some lower extent cupy as well) array API compliance is still incomplete and in practice one needs to go through this compatibility layer (hopefully temporarily):
One could also "import jax.numpy as jnp". All those libraries have more or less complete implementations of numpy and scipy (i believe CuPy has the most functions, especially when it comes to scipy) functionality.
Also: You can just mix match all those functions and tensors thanks to the __cuda_array_interface__.
As nice as it is to have a drop in replacement, most of the cost of GPU computing is moving memory around. Wouldn’t be surprised if this catches unsuspecting programmers in a few performance traps.
It only supports AMD cards supported by ROCm, which is quite a limited set.
I know you can enable ROCm for other hardware as well, but it's not supported and quite hit or miss. I've had limited success with running stuff against ROCm on unsupported cards, mainly having issues with memory management IIRC.
I'm supposed to end my undergraduate degree with an internship at the italian national research center and i'll have to use pytorch to write ml models from paper to code, i've tried looking at the tutorial but i feel like there's a lot going on to grasp. until now i've only used numpy (and pandas in combo with numpy), i'm quite excited but i'm a bit on the edge because i can't know whether i'll be up to the task or not
I've recently had to implement a few kernels to lower the memory footprint and runtime of some pytorch function : it's been really nice because numba kernels have type hints support (as opposed to raw cupy kernels).
When building something that I want to run on both CPU and GPU, depending, I’ve found it much easier to use PyTorch than some combination of NumPy and CuPy. I don’t have to fiddle around with some global replacing of numpy.* with cupy.*, and PyTorch has very nearly all the functions that those libraries have.
Interesting. Any links to examples or docs on how to use PyTorch as a general linear algebra library for this purpose? Like a “SciPy to PyTorch” transition guide if I want to do the same?
Cupy was first but at this point you're better of using JAX.
Has a much larger community, a big push from Google Research, and unlike PFN's Chainer (of which CuPy is the computational base), is not semi-abandoned.
Kind of sad to see CuPy/Chainer eco-system die: not only did they pioneer the PyTorch programming model, but also stuck to Numpy API like JAX does (though the AD is layered on top in Chainer IIRC).
"This is a research project, not an official Google product. Expect bugs and sharp edges. Please help by trying it out, reporting bugs, and letting us know what you think!"
disclaimer in its readme. This is quite scary, especially coming from Google which is known to abandon projects out of the blue
I tried Jax last year and was not impressed. Maybe it was just me, but everything I tried to do (especially with the autograd stuff) involved huge compilation times and I simply could not get the promised performance. Maybe I’ll try again and read some new documentation, since everyone is excited about it.
I taught my numpy class to a client who wanted to use GPUs. Installation (at that time) was a chore but afterwards it was really smooth using this library. Big gains with minimal to no code changes.
As an aside, since I was trying to install CuPy the other day and was having issues.
Open projects on github often (at least superficially) require specific versions of Cuda Toolkit (and all the specialty nvidia packages e.g. cudann), Tensorflow, etc, and changing the default versions of these for each little project, or step in a processing chain, is ridiculous.
pyenv et al have really made local, project specific versions of python packages much easier to manage. But I haven't seen a similar type solution for cuda toolkit and associated packages, and the solutions I've encountered seem terribly hacky..but I'm sure though that this is a common issue, so what do people do?
As a maintainer of CuPy and also as a user of several GPU-powered Python libraries, I empathize with the frustrations and difficulties here. Indeed, one thing CuPy values is to make the installation step as easy and universal as possible. We strive to keep the binary package footprint small (currently less than 100 MiB), keep dependencies to a minimum, support wide variety of platforms including Windows and aarch64, and do not require a specific CUDA Toolkit version.
If anyone reading this message has encountered a roadblock while installing CuPy, please reach out. I'd be glad to help you.
One way to do it is to explicitly add the link to say, the pytorch+CUDA wheel from the pytorch repos in your requirements.txt instead of using the normal pypi package. Which also sucks because you then have to do some other tweaks to make your requirements.txt portable across different platforms...
(and you can't just add another index for pip to look for if you want to use python build so it has to be explicitly linked to the right wheel, which absolutely sucks especially since you cannot get the CUDA version from pypi)
- CuPy is much closer to CUDA than JAX, so you can get better performance
- CuPy is generally more mature than JAX (fewer bugs)
- CuPy is more flexible thanks to cp.RawKernel
- (For those familiar with NumPy) CuPy is closer to NumPy than jax.numpy
But CuPy does not support automatic gradient computation, so if you do deep learning, use JAX instead. Or PyTorch, if you do not trust Google to maintain a project for a prolonged period of time https://killedbygoogle.com/
Real answer: CuPy has a name that is very similar to SciPy. I don’t know GPU, that’s why I’m using this sort of library, haha. The branding for CuPy makes it obvious. Is Jax the same thing, but implemented better somehow?
I’ve been using CuPy a bit and found it to be excellent.
It’s very easy to replace some slow NumPy/SciPy calls with appropriate CuPy calls, with sometimes literally a 1000x performance boost from like 10min work. It’s also easy to write “hybrid code” where you can switch between NumPy and CuPy depending on what’s available.
Good a place as any to ask I guess. Do any of these GPU libraries have a BiCGStab (or similar) that handles multiple right hand sides? CuPy seems to have GMRES, which would be fine, but as far as I can tell it just does one right hand side.
Is anyone aware of a pandas like library that is based on something like CuPy instead of Numpy. It would be great to have the ease of use of pandas with the parallelism unlocked by gpu.
We are fans! We mostly use cudf/cuml/cugraph (GPU dataframes etc) in the pygraphistry ecosystem, and when things get a bit tricky, cupy is one of the main escape hatches
[+] [-] gjstein|1 year ago|reply
However, the AMD-GPU compatibility for CuPy is quite an attractive feature.
[+] [-] ogrisel|1 year ago|reply
https://data-apis.org/array-api/
So it's possible to write array API code that consumes arrays from any of those libraries and delegate computation to them without having to explicitly import any of them in your source code.
The only limitation for now is that PyTorch (and to some lower extent cupy as well) array API compliance is still incomplete and in practice one needs to go through this compatibility layer (hopefully temporarily):
https://data-apis.org/array-api-compat/
[+] [-] KeplerBoy|1 year ago|reply
Also: You can just mix match all those functions and tensors thanks to the __cuda_array_interface__.
[+] [-] Narhem|1 year ago|reply
[+] [-] BiteCode_dev|1 year ago|reply
[+] [-] paperplatter|1 year ago|reply
[+] [-] WCSTombs|1 year ago|reply
Last I checked (a couple months ago) it wasn't quite there, but I totally agree in principle. I've not gotten it to work on my Radeons yet.
[+] [-] sspiff|1 year ago|reply
I know you can enable ROCm for other hardware as well, but it's not supported and quite hit or miss. I've had limited success with running stuff against ROCm on unsupported cards, mainly having issues with memory management IIRC.
[+] [-] hedgehog|1 year ago|reply
[+] [-] amarcheschi|1 year ago|reply
[+] [-] curvilinear_m|1 year ago|reply
I've recently had to implement a few kernels to lower the memory footprint and runtime of some pytorch function : it's been really nice because numba kernels have type hints support (as opposed to raw cupy kernels).
[+] [-] killingtime74|1 year ago|reply
[+] [-] meisel|1 year ago|reply
[+] [-] setopt|1 year ago|reply
[+] [-] johndough|1 year ago|reply
And I recently learned that CuPy has a JIT compiler now if you prefer Python syntax over C++. https://docs.cupy.dev/en/stable/user_guide/kernel.html#jit-k...
[+] [-] einpoklum|1 year ago|reply
In Python? Perhaps. Generally? No, it isn't. Try: https://github.com/eyalroz/cuda-api-wrappers/
Full power of the CUDA APIs including all runtime compilation options etc.
(Yes, I wrote that...)
[+] [-] aterrel-nvidia|1 year ago|reply
Would love to get any feedback from the community.
[+] [-] towerwoerjrrwer|1 year ago|reply
Has a much larger community, a big push from Google Research, and unlike PFN's Chainer (of which CuPy is the computational base), is not semi-abandoned.
Kind of sad to see CuPy/Chainer eco-system die: not only did they pioneer the PyTorch programming model, but also stuck to Numpy API like JAX does (though the AD is layered on top in Chainer IIRC).
[+] [-] WanderPanda|1 year ago|reply
"This is a research project, not an official Google product. Expect bugs and sharp edges. Please help by trying it out, reporting bugs, and letting us know what you think!"
disclaimer in its readme. This is quite scary, especially coming from Google which is known to abandon projects out of the blue
[+] [-] low_tech_love|1 year ago|reply
[+] [-] kmaehashi|1 year ago|reply
[+] [-] p1esk|1 year ago|reply
[+] [-] __mharrison__|1 year ago|reply
[+] [-] SubiculumCode|1 year ago|reply
Open projects on github often (at least superficially) require specific versions of Cuda Toolkit (and all the specialty nvidia packages e.g. cudann), Tensorflow, etc, and changing the default versions of these for each little project, or step in a processing chain, is ridiculous.
pyenv et al have really made local, project specific versions of python packages much easier to manage. But I haven't seen a similar type solution for cuda toolkit and associated packages, and the solutions I've encountered seem terribly hacky..but I'm sure though that this is a common issue, so what do people do?
[+] [-] kmaehashi|1 year ago|reply
If anyone reading this message has encountered a roadblock while installing CuPy, please reach out. I'd be glad to help you.
[+] [-] ttyprintk|1 year ago|reply
NB I’m using Miniforge.
[+] [-] mardifoufs|1 year ago|reply
(and you can't just add another index for pip to look for if you want to use python build so it has to be explicitly linked to the right wheel, which absolutely sucks especially since you cannot get the CUDA version from pypi)
[+] [-] welder|1 year ago|reply
[+] [-] coeneedell|1 year ago|reply
[+] [-] m_d_|1 year ago|reply
[+] [-] whimsicalism|1 year ago|reply
[+] [-] sdenton4|1 year ago|reply
[+] [-] johndough|1 year ago|reply
- JAX Windows support is lacking
- CuPy is much closer to CUDA than JAX, so you can get better performance
- CuPy is generally more mature than JAX (fewer bugs)
- CuPy is more flexible thanks to cp.RawKernel
- (For those familiar with NumPy) CuPy is closer to NumPy than jax.numpy
But CuPy does not support automatic gradient computation, so if you do deep learning, use JAX instead. Or PyTorch, if you do not trust Google to maintain a project for a prolonged period of time https://killedbygoogle.com/
[+] [-] bee_rider|1 year ago|reply
[+] [-] palmy|1 year ago|reply
Cool to see that it's still kicking!
[+] [-] setopt|1 year ago|reply
It’s very easy to replace some slow NumPy/SciPy calls with appropriate CuPy calls, with sometimes literally a 1000x performance boost from like 10min work. It’s also easy to write “hybrid code” where you can switch between NumPy and CuPy depending on what’s available.
[+] [-] glial|1 year ago|reply
[+] [-] markkitti|1 year ago|reply
[+] [-] einpoklum|1 year ago|reply
[+] [-] adancalderon|1 year ago|reply
[+] [-] whimsicalism|1 year ago|reply
Jax, pytorch, vanilla TF, triton. They just don’t cut it
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] bee_rider|1 year ago|reply
[+] [-] hamilyon2|1 year ago|reply
[+] [-] kunalgupta022|1 year ago|reply
[+] [-] kmaehashi|1 year ago|reply
[+] [-] Scene_Cast2|1 year ago|reply
[+] [-] lokimedes|1 year ago|reply
[+] [-] lmeyerov|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]