Interesting! this was already the case with TPUs easily beating A100s. We sell Stable Diffusion finetuning on TPUs (dreamlook.ai), people are amazed how fast and cheap we can offer it - but there's no big secret, we just use hardware that's strictly faster and cheaper per unit of work.
I expect a new wave of "your task, but on superior hardware" services to crop up with these chips!
The v5es and v5ps are pretty amazing at running SD, giving code for SD3 now to optimise it on those.
v5es are particularly interesting given the millions that will land and the large pod sizes, particularly well constructed for million token context windows.
This is nice to foster some competition in hardware for model training, but the availability of these machines seems very limited - I don't think there's any major cloud provider allowing per hour rental of Gaudi2 VMs and Intel's own site directs you to buy an 8x GPU provisioned server from Supermicro for more than 40k USD. Availability and software stack is still heavily in Nvidia's favor right now, but maybe by the end of the year that will start changing.
Some analysis of how and/or why it is able to be 3x faster despite no hardware metric being 3x better would make this actually useful and insightful instead of advertising.
Hasn't H100 been shipping in volume for about a year already? Is Gaudi2 even available at comparable scale yet? I wouldn't count Nvidia out until they start slipping on similar timescales, i.e. if B100 doesn't have a clear lead over competing parts that become available at roughly the same time.
I think as we go to enterprise workloads the total cost of ownership becomes important.
NVIDIA is still the best for research given ecosystem but once the models are standardised as with transformers/LLaMA and likely multimodal diffusion transformers it then becomes about scale, availability and cost per flop.
It took less than a day to port our code over, we do custom CUDA across modalities.
Gaudi2 was actually announced 2 years ago and is 7nm like the A100 80Gb it was meant to be competitive with, Gaudi3 later this year is probably going to be the inflection point as that ramps
The fact that AMD's GPGPU platform is buggy for consumers has more to do with incompetence and product cannibalisation than the difficulty of building properly working drivers. Machine learning uses profoundly simple operations. Building a pytorch backend isn't difficult if the drivers are working properly.
I'm wondering how AI scientists work these days. Do they really hack Cudakernels or do they plug models together with highlevel toolkits like pytorch?
Considering its the latter, considering pytorch takes care of providing optimized backends for various hardwares, how big of a moat is Cuda then really?
One question I have that nobody, including an Intel AXG employee, has been able to answer satisfsctorily for me is why both Gaudi and Ponte Vecchio exist. Wouldn't Intel have better chances of success if they focused on one product line?
Gaudi was brought into Intel via an acquisition. Ponte Vecchio was an internal program. It can be explained by a combination of management silos and perhaps pre-existing obligations for Ponte Vecchio with the government for how they both came into being
From my understanding, Gaudi specializes in a specific use case (deep learning/AI) while Ponte Vecchio is more generic HPC. Also, DL/AI accelerators may not work well with newer models so the generic HPC hardware may be the only option for certain models until the DL/AI accelerators have a chance to catch up.
Gaudi3 is supposedly due this year with a 4X bump in Bf16 training over Gaudi2. Gaudi is an interesting product. Intel seems to have something pretty decent but it hasn't seen much of a volume release yet. Maybe that comes with V3? Not sure exactly what their strategy with it is.
We do know that in 2025 it's supposed to be part of Intel's Falcon Shores HPC XPU. This essentially takes a whole bunch of HPC compute and sticks it all on the same silicon to maximize throughput and minimize latency. Thanks to their tile-based chip strategy they can have many different versions of the chip with different HPC focuses by swapping out different tiles. AI certainly seems to be a major one, but it will be interesting to see what products they come up with.
It was interesting Aurora used GPU Max & definitely looking forward to Falcon Shores.
I think Gaudi2 was bad timed & they had to build stack, Gaudi3 is where I think we will see mass adoption given availability, way cheaper price/performance & maturer stack.
There is still weird stuff when using them but they are surprisingly solid.
I would potentially be interested in Gaudi-based workstation. Supermicro servers seem good, but they do not have DisplayPort outputs, and jury-rigging them on is not something I'd do.
Frankly, this may be good to level out the market a bit. While it's been fun to see Nvidia rise up through this insanity, it would only be healthy to have others catch up here and there eventually.
Yeah they train well and very stably even int8, maxtext now has LLaMA and mistral support too, pytorch xla gets 50% MFU with spmd and you have some nice stacks like levanter
Haven't been too impressed with inference versus tensor rt llm for example though
Gaudi is a famous name for a reason.. the flowing lines and frankly, nonsense and silliness, in the art and architecture of Gaudi stands for generations as a contrast to the relentless severity of formal classical arts (and especially a contrast to Intel electronic parts).
"For Stable Diffusion 3, we measured the training throughput for the 2B Multimodal Diffusion Transformer (MMDiT) architecture model. Gaudi 2 trained images 1.5x faster than the H100-80GB, and 3x faster than A100-80GB GPU’s when scaled up to 32 nodes. "
It has been amazing watching the groupthink at work on that stock when we just saw the same group do it on TSLA to disastrous effect. A similar no moat situation where they simply can’t imagine competitors ever existing.
MasterScrat|2 years ago
I expect a new wave of "your task, but on superior hardware" services to crop up with these chips!
memossy|2 years ago
v5es are particularly interesting given the millions that will land and the large pod sizes, particularly well constructed for million token context windows.
doctorpangloss|2 years ago
renewiltord|2 years ago
Flux159|2 years ago
memossy|2 years ago
We found cuda sycl conversion surprisingly good https://www.intel.com/content/www/us/en/developer/articles/t...
GaggiX|2 years ago
Isn't that the price of a single H100?
GC_tris|2 years ago
Genesis Cloud started integration and testing of Gaudi2 quite a while ago. I fully agree with the take of the article.
I can't promise per hour rental, but for longer times they are available! (should you be interested you can find contact details on the website)
az226|2 years ago
1024core|2 years ago
wmf|2 years ago
Now working ones is a different story.
justapassenger|2 years ago
[deleted]
ekelsen|2 years ago
unknown|2 years ago
[deleted]
jsheard|2 years ago
memossy|2 years ago
NVIDIA is still the best for research given ecosystem but once the models are standardised as with transformers/LLaMA and likely multimodal diffusion transformers it then becomes about scale, availability and cost per flop.
ABS|2 years ago
To those commenting about "no moat" remember CUDA is a huge part of it, it's actually HW+SW and both took a decade to mature, together
memossy|2 years ago
Gaudi2 was actually announced 2 years ago and is 7nm like the A100 80Gb it was meant to be competitive with, Gaudi3 later this year is probably going to be the inflection point as that ramps
The cost is like 1/3
https://www.intel.com/content/www/us/en/newsroom/news/vision...
imtringued|1 year ago
yukIttEft|2 years ago
Considering its the latter, considering pytorch takes care of providing optimized backends for various hardwares, how big of a moat is Cuda then really?
david-gpu|2 years ago
In other words, it goes something like this:
My opinion, based on what I saw those wizards do, is that reproducing the feature set and efficiency of cuDNN/cuBLAS is deeply nontrivial.cherryteastain|2 years ago
georgeburdell|2 years ago
meragrin_|2 years ago
wmf|2 years ago
thunderbird120|2 years ago
We do know that in 2025 it's supposed to be part of Intel's Falcon Shores HPC XPU. This essentially takes a whole bunch of HPC compute and sticks it all on the same silicon to maximize throughput and minimize latency. Thanks to their tile-based chip strategy they can have many different versions of the chip with different HPC focuses by swapping out different tiles. AI certainly seems to be a major one, but it will be interesting to see what products they come up with.
memossy|2 years ago
I think Gaudi2 was bad timed & they had to build stack, Gaudi3 is where I think we will see mass adoption given availability, way cheaper price/performance & maturer stack.
There is still weird stuff when using them but they are surprisingly solid.
tromp|2 years ago
[1] https://www.intel.com/content/www/us/en/developer/articles/t...
BryanLegend|2 years ago
memossy|2 years ago
lostmsu|1 year ago
mittermayr|2 years ago
qeternity|2 years ago
emadm|2 years ago
Haven't been too impressed with inference versus tensor rt llm for example though
mistrial9|2 years ago
Gaudi is a famous name for a reason.. the flowing lines and frankly, nonsense and silliness, in the art and architecture of Gaudi stands for generations as a contrast to the relentless severity of formal classical arts (and especially a contrast to Intel electronic parts).
CrocODil|1 year ago
throwaway4good|2 years ago
throwaway4good|2 years ago
memossy|2 years ago
Mistletoe|2 years ago
It has been amazing watching the groupthink at work on that stock when we just saw the same group do it on TSLA to disastrous effect. A similar no moat situation where they simply can’t imagine competitors ever existing.
unknown|2 years ago
[deleted]
unknown|2 years ago
[deleted]