top | item 38887799

(no title)

"Based on these public on-demand quoted prices from AWS and IDC, we found that the IntelR GaudiR 2 has the best training performance-per-dollar, with an average advantage of 4.8x vs the NVIDIA A100-80GB, 4.2x vs. the NVIDIA A100-40GB, and 5.19x vs. the NVIDIA H100"

discuss

ShamelessC|2 years ago

Seems there's some friction in porting software as you have to use their build of pytorch. They claim you just have to change your specified device in `.to(device:str)` statements but, if someone could verify that it would be appreciated. My experience with porting software to Google's TPU's or AMD GPU's has been not great.

kkielhofner|2 years ago

I fully support Nvidia competition. Their monopoly is bad news for a variety of reasons, obviously.

However, as you note many of these implementations (Intel, AMD, Google TPU, etc) are more or less at the “get PyTorch to kind of work” stage.

I don’t know of many/any real world applications that are “vanilla” PyTorch at this point.

Stuff like Flash Attention (2), HF accelerate/optimum, distributed training implementations, Deepspeed, custom CUDA kernels all over the place, TensorRT, PyTorch 2 compile, SPDA, serving frameworks, etc. The software stacks and resulting functionality, usability, and performance CUDA “owns” are truly endless.

Any real project or implementation I’ve touched in the last year is so intertwined and dependent on CUDA it’s mind blowing and the CUDA lead is only increasing.

With AMD/ROCm as one example when you finally kind of get things to sort of work even though the hardware is potentially competitive on paper the software ecosystem is so far behind you’re happy to pay the “Nvidia tax” because not only is CUDA significantly smoother overall the endless stacks of optimized software implementations for CUDA make any allegedly comparable implementations run at a fraction of the speed while also burning dev time left and right.

Love or hate Nvidia the 15 year investment and dominance of CUDA is very apparent to anyone who’s actually working with this stuff and just trying get something done.

Again, as you note it’s interesting to watch observers/casual users claim these implementations are competitive because in my experience you get even one level deeper and it’s a complete nightmare. I try ROCm every couple of months and end up laughing and/or shaking my head at just how far behind it is (after six years).

I’m really rooting for them but the reality is these CUDA “competitors” have a very very long way to go.

ilaksh|2 years ago

I looked in their Intel Developer Cloud and saw the $10.42/hr 8x but there is no individual 1x Gaudi 2 there that I could see. The $1.30/hr could be okay for some inference use case though if it were available. Although for what I was thinking, llama.cpp is not going to work anyway.