top | item 41194996

(no title)

wackycat | 1 year ago

I have limited experience with CUDA but will this help solve the CUDA/CUDNN dependency version nightmare that comes with running various ML libraries like tensorflow or onnx?

discuss

order

bstockton|1 year ago

My experience, over 10 years building models with libraries using CUDA under the hood, this problem has nearly gone away in the past few years. Setting up CUDA on new machines and even getting multi GPU/nodes configuration working with NCCL and pytorch DDP, for example, is pretty slick. Have you experienced this recently?

jokethrowaway|1 year ago

yes, especially if you are trying to run various different projects you don't control

some will need specific versions of cuda

right now I masked cuda from upgrades in my system and I'm stuck on an old version to support some projects

I also had plenty of problems with gpu-operator to deploy on k8s: that helm chart is so buggy (or maybe just not great at handling some corner cases? no clue) I ended up swapping kubernetes distribution a few times (no chance to make it work on microk8s, on k3s it almost works) and eventually ended up installing drivers + runtime locally and then just exposing through containerd config

trueismywork|1 year ago

That's torches bad software distribution problem. No one can solve it apart from torch distributors

amelius|1 year ago

By the way, can anyone explain why libcudnn takes on the order of gigabytes on my harddrive?

lldb|1 year ago

Primarily because it has specialized functions for various matrix sizes which are selected at runtime.