top | item 39149222

(no title)

glalonde | 2 years ago

Is that true and is that sustainable? My understanding has been that only the relatively low level libraries/kernels are written in cuda and all the magical algorithms are in python using various ML libaries. It's like how intel BLAS isn't much of a moat -- there are several open source implementations and you can mix and match.

How is CUDA so sticky when most ML devs aren't writing CUDA but something several layers of abstraction above it? Why can't intel, AMD, google w/e come along and write an adapter for that lowest level to TF, pytorch or whatever is the framework of the day?

discuss

1024core|2 years ago

> Why can't intel, AMD, google w/e come along and write an adapter for that lowest level to TF, pytorch or whatever is the framework of the day?

A long long time ago, i.e. the last time AMD was competing with Intel (before this time, that is), we used to use Intel's icc in our lab to optimize for the Intel CPUs and squeeze as much as possible out of them. Then AMD came out with their "Athlon"(?) and it was an Intel-beater at that time. But AMD never released a compiler for it; I bet they had one internally, but we had to rely on plain old GCC.

These hardware companies don't seem to get that a kick-ass software can really add wings to their hardware sales. If I were a hardware vendor, I would, if nothing else, make my hardware's software open so the community can run with it and create better software; which will result in more hardware sales!

cavisne|2 years ago

It's a good question. I think fundamentally its because no wants to/can compete with Nvidia when making a general purpose parallel processor. They all want to make something a bit more specialized, so they need to guess what functionality is needed and not needed.

This is a really tricky guess, case in point that AMD's latest chip cant compete on training because they could not get Flash Attention 2 working on the backward pass because of their hardware architecture. [1]

Attempts to abstract at a higher layer have failed so far because that lower layer is really valuable, again Flash Attention is a good example.

[1] https://www.semianalysis.com/p/amd-mi300-performance-faster-...