The vast majority of work in ML isn't people working with CUDA directly - people use open source frameworks like PyTorch and TensorFlow to define a network and train it, and all the frameworks support CUDA as a backend.
Other backends are also available, such as CPU-only training. And you can export networks in reasonably-standard formats.
nvidia's moat is much more mature framework support than AMD's cards; widespread popularity due to that good framework support, ensuring everyone develops on nvidia, thus maintaining their support lead; much faster performance than CPU-only training; and a price that, though high, is a lot less than an ML developer's salary.
If you need 24GB of vram and nvidia offers that for $1600 while AMD offers it for $1300, how many compatibility problems do you want to deal with to save a single day's wages?
But nvidia's moat is far from guaranteed. Huge users like OpenAI and Facebook might find improving AMD support pays for itself.
> Huge users like OpenAI and Facebook might find improving AMD support pays for itself.
At that scale they may actually develop their own hardware a la Google TPU.
If you want to just focus on the AI problem and not on infrastructure, just use NVidia. If you want control and efficiency, design your own. AMD kind of falls in a weird middle ground with respect to the massive companies.
CUDA code can be forward-ported to AMD's HIP, which can be used with the ROCm stack. For a more standards-focused alternative there's also SYCL, which has implementations targeting a variety of hardware backends (including HIP) and may also target Vulkan Compute in the future.
StableHLO[1] and IREE[2] are interesting projects that might help AMD here, from [1]:
> Our goal is to simplify and accelerate ML development by creating more interoperability between various ML frameworks (such as TensorFlow, JAX and PyTorch) and ML compilers (such as XLA and IREE).
From there, their goal would most likely be to work with XLA/OpenXLA teams on XLA[3] and IREE[2] to make RoCM a better backend.
michaelt|2 years ago
Other backends are also available, such as CPU-only training. And you can export networks in reasonably-standard formats.
nvidia's moat is much more mature framework support than AMD's cards; widespread popularity due to that good framework support, ensuring everyone develops on nvidia, thus maintaining their support lead; much faster performance than CPU-only training; and a price that, though high, is a lot less than an ML developer's salary.
If you need 24GB of vram and nvidia offers that for $1600 while AMD offers it for $1300, how many compatibility problems do you want to deal with to save a single day's wages?
But nvidia's moat is far from guaranteed. Huge users like OpenAI and Facebook might find improving AMD support pays for itself.
RcouF1uZ4gsC|2 years ago
At that scale they may actually develop their own hardware a la Google TPU.
If you want to just focus on the AI problem and not on infrastructure, just use NVidia. If you want control and efficiency, design your own. AMD kind of falls in a weird middle ground with respect to the massive companies.
zozbot234|2 years ago
meragrin_|2 years ago
Maybe in some cases, but that doesn't even really matter since hardware support is poor.
PartiallyTyped|2 years ago
> Our goal is to simplify and accelerate ML development by creating more interoperability between various ML frameworks (such as TensorFlow, JAX and PyTorch) and ML compilers (such as XLA and IREE).
From there, their goal would most likely be to work with XLA/OpenXLA teams on XLA[3] and IREE[2] to make RoCM a better backend.
[1] https://github.com/openxla/stablehlo
[2] https://github.com/openxla/iree
[3] https://www.tensorflow.org/xla