One nice thing about this (and the new offerings from AMD) is that they will be using the "open accelerator module (OAM)" interface- which standardizes the connector that they use to put them on baseboards, similar to the SXM connections of Nvidia that use MegArray connectors to thier baseboards.
With Nvidia, the SXM connection pinouts have always been held proprietary and confidential. For example, P100's and V100's have standard PCI-e lanes connected to one of the two sides of their MegArray connectors, and if you know that pinout you could literally build PCI-e cards with SXM2/3 connectors to repurpose those now obsolete chips (this has been done by one person).
There are thousands, maybe tens of thousands of P100's you could pickup for literally <$50 apiece these days which technically give you more Tflops/$ than anything on the market, but they are useless because their interface was not ever made open and has not been reverse engineered openly and the OEM baseboards (Dell, Supermicro mainly) are still hideously expensive outside China.
I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.
Pascal series are cheap because they are CUDA compute capability 6.0 and lack Tensor Cores. Volta (7.0) was the first to have Tensor Cores and in many cases is the bare minimum for modern/current stacks.
See flash attention, triton, etc as core enabling libraries. Not to mention all of the custom CUDA kernels all over the place. Take all of this and then stack layers on top of them...
Unfortunately there is famously "GPU poor vs GPU rich". Pascal puts you at "GPU destitute" (regardless of assembled VRAM) and outside of implementations like llama.cpp that go incredible and impressive lengths to support these old archs you will very quickly run into show-stopping issues that make you wish you just handed over the money for >= 7.0.
I support any use of old hardware but this kind of reminds me of my "ancient" X5690 that has impressive performance (relatively speaking) but always bites me because it doesn't have AVX.
I really like this side to AMD. There's a strategic call somewhere high up to bias towards collaboration with other companies. Sharing the fabric specifications with broadcom was an amazing thing to see. It's not out of the question that we'll see single chips with chiplets made by different companies attached together.
The price is low because they’re useless (except for replacing dead cards in a DGX), if you had a 40$ PCIe AIC-to-SXM adapter, the price would go up a lot.
> I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.
Very cool hobby. It’s also unfortunate how stringent e-waste rules lead to so much perfectly fine hardware to be scrapped. And how the remainder is typically pulled apart to the board / module level for spares. Makes it very unlikely to stumble over more or less complete-ish systems.
As “humble” as NVIDIA’s CEO appears to be, NVIDIA the company (he’s been running this whole time), made decision after decision with the simple intention of killing off its competition (ATI/AMD). Gameworks is my favorite example- essentially if you wanted a video game to look as good as possible, you needed an NVIDIA GPU. Those same games played on AMD GPUs just didn’t look as good.
Now that video gaming is secondary (tertiary?) to Nvidia’s revenue stream, they could give a shit which brand gamers prefer. It’s small time now. All that matters is who companies are buying their GPUs from for AI stuff. Break down that CUDA wall and it’s open-season. I wonder how they plan to stave that off. It’s only a matter of time before people get tired of writing C++ code to interface with CUDA.
A bit surprised that they're using HBM2e, which is what Nvidia A100 (80GB) used back in 2020. But Intel is using 8 stacks here, so Gaudi 3 achieves comparable total bandwidth (3.7TB/s) to H100 (3.4TB/s) which uses 5 stacks of HBM3. Hopefully the older HBM has better supply - HBM3 is hard to get right now!
The Gaudi 3 multi-chip package also looks interesting. I see 2 central compute dies, 8 HBM die stacks, and then 6 small dies interleaved between the HBM stacks - curious to know whether those are also functional, or just structural elements for mechanical support.
> A bit surprised that they're using HBM2e, which is what Nvidia A100 (80GB) used back in 2020.
This is one of the secret recipes of Intel. They can use older tech and push it a little further to catch/surpass current gen tech until current gen becomes easier/cheaper to produce/acquire/integrate.
They have done it with their first quad core processors by merging two dual core processors (Q6xxx series), or by creating absurdly clocked single core processors aimed at very niche market segments.
We have not seen it until now, because they were sleeping at the wheel, and knocked unconscious by AMD.
This is a bit snarky — but will Intel actually keep this product line alive for more than a few years? Having been bitten by building products around some of their non-x86 offerings where they killed good IP off and then failed to support it… I’m skeptical.
I truly do hope it is successful so we can have some alternative accelerators.
The real question is, how long does it actually have to hang around really? With the way this market is going, it probably only has to be supported in earnest for a few years by which point it'll be so far obsolete that everyone who matters will have moved on.
What’s Next: Intel Gaudi 3 accelerators' momentum will be foundational for Falcon Shores, Intel’s next-generation graphics processing unit (GPU) for AI and high-performance computing (HPC). Falcon Shores will integrate the Intel Gaudi and Intel® Xe intellectual property (IP) with a single GPU programming interface built on the Intel® oneAPI specification.
I think it's a valid question. Intel has a habit of whispering away anything that doesn't immediately ship millions of units or that they're contractually obligated to support.
I'm not very involved in the broader topic, but isn't the shortage of hardware for AI-related workloads intense enough so as to grant them the benefit of the doubt?
I haven’t read the article but my first question would be “what problem is this accelerator solving?” and if the answer is simply “you can AI without Nvidia”, that’s not good enough, because that’s the pot calling the kettle black. None of these companies is “altruistic” but between the three of them I expect AMD to be the nicest to its customers. Nvidia will squeeze the most money out of theirs, and Intel will leave theirs out to dry when corporate leadership decides it’s a failure.
> Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 accelerator
WHAT‽ It's basically got the equivalent of a 24-port, 200-gigabit switch built into it. How does that make sense? Can you imaging stringing 24 Cat 8 cables between servers in a single rack? Wait: How do you even decide where those cables go? Do you buy 24 Gaudi 3 accelerators and run cables directly between every single one of them so they can all talk 200-gigabit ethernet to each other?
Also: If you've got that many Cat 8 cables coming out the back of the thing how do you even access it? You'll have to unplug half of them (better keep track of which was connected to what port!) just to be able to grab the shell of the device in the rack. 24 ports is usually enough to take up the majority of horizontal space in the rack so maybe this thing requires a minimum of 2-4U just to use it? That would make more sense but not help in the density department.
I'm imagining a lot of orders for "a gradient" of colors of cables so the data center folks wiring the things can keep track of which cable is supposed to go where.
> The Gaudi 3 accelerators inside of the nodes are connected using the same OSFP links to the outside world as happened with the Gaudi 2 designs, but in this case the doubling of the speed means that Intel has had to add retimers between the Ethernet ports on the Gaudi 3 cards and the six 800 Gb/sec OSFP ports that come out of the back of the system board. Of the 24 ports on each Gaudi 3, 21 of them are used to make a high-bandwidth all-to-all network linking those Gaudi 3 devices tightly to each other. Like this:
> As you scale, you build a sub-cluster with sixteen of these eight-way Gaudi 3 nodes, with three leaf switches – generally based on the 51.2 Tb/sec “Tomahawk 5” StrataXGS switch ASICs from Broadcom, according to Medina – that have half of their 64 ports running at 800 GB/sec pointing down to the servers and half of their ports pointing up to the spine network. You need three leaf switches to do the trick:
> To get to 4,096 Gaudi 3 accelerators across 512 server nodes, you build 32 sub-clusters and you cross link the 96 leaf switches with a three banks of sixteen spine switches, which will give you three different paths to link any Gaudi 3 to any other Gaudi 3 through two layers of network. Like this:
The cabling works out neatly in the rack configurations they envision. The idea here is to use standard Ethernet instead of proprietary Infiniband (which Nvidia got from acquiring Mellanox). Because each accelerator can reach other accelerators via multiple paths that will (ideally) not be over-utilized, you will be able to perform large operations across them efficiently without needing to get especially optimized about how your software manages communication.
Infiniband I've heard as incredibly annoying to deal with procuring as well as some other aspects of it, so lots of folks very happy to get RoCE (ethernet) working instead, even if it is a bit cumbersome.
For Gaudi2, it looks like 21/24 ports are internal to the server. I highly doubt those have actual individual cables. Most likely they're just carried on PCBs like any other signal.
100GBe is only supported on twinax anyway, so Cat8 is irrelevant here. The other 3 ports are probably QSFP or something.
Probably not. An 40GB Nvidia A100 is arguably reasonable for a workstation at $6000. Depending on your definition an 80GB A100 for $16000 is still reasonable. I don't see this being cheaper than an 80GB A100. Probably a good bit more expensive, seeing as it has more RAM, compares itself favorably to the H100, and has enough compelling features that it probably doesn't have to (strongly) compete on price.
128GB in one chip seems important with the rise of sparse architectures like MoE. Hopefully these are competitive with Nvidia's offerings, though in the end they will be competing for the same fab space as Nvidia if I'm not mistaken.
I wonder if with the advent of LLMs being able to spit out perfect corpo-speak everyone will recenter to succint and short "here's the gist" as the long version will become associated to cheap automated output.
Has anyone here bought an AI accelerator to run their AI SaaS service from their home to customers instead of trying to make a profit on top of OpenAI or Replicate
Seems like an okay $8,000 - $30,000 investment, and bare metal server maintenance isn’t that complicated these days.
I wonder if someone knowledgeable could comment on OneAPI vs Cuda. I feel like if Intel is going to be a serious competitor to Nvidia, both software and hardware are going to be equally important.
I'm not familiar with the particulars of OneAPI, but it's just a matter of rewriting CUDA kernels into OneAPI. This is pretty trivial for the vast majority of small (<5 LoC) kernels. Unlike AMD, it looks like they're serious about dogfooding their own chips, and they have a much better reputation for their driver quality.
If your metric is memory bandwidth or memory size, then this announcement gives you some concrete information. But - suppose my metric for performance is matrix-multiply-add (or just matrix-multiply) bandwidth. What MMA primitives does Gaudi offer (i.e. type combinations and matrix dimension combinations), and how many of such ops per second, in practice? The linked page says "64,000 in parallel", but that does not actually tell me much.
Gaudi 3 has PCIe 4.0 (vs. H100 PCIe 5.0, so 2x the bandwidth). Probably not a deal-breaker but it's strange for Intel (of all vendors) to lag behind in PCIe.
I liked my 5700XT. That seems to be $200 now. Ran arbitrary code on it just fine. Lots of machine learning seems to be obsessed with amount of memory though and increasing that is likely to increase the price. Also HN doesn't like ROCm much, so there's that.
What else in on the BOM? Volume? At that price you likely want to use whatever resources are on the SoC that runs the thing and work around that. Feel free to e-mail me.
>Intel Gaudi software integrates the PyTorch framework and provides optimized Hugging Face community-based models – the most-common AI framework for GenAI developers today. This allows GenAI developers to operate at a high abstraction level for ease of use and productivity and ease of model porting across hardware types.
what is the programming interface here ? this is not CUDA right ...so how is this being done ?
I feel a little misled by the speedup numbers. They are comparing lower batch size h100/200 numbers to higher batch size gaudi 3 numbers for throughput (which is heavily improved by increasing batch size). I feel like there are some inference scenarios where this is better, but its really hard to tell from the numbers in the paper.
> Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 accelerator
How much does a single 200Gbit active (or inactive) fiber cable cost? Probably thousands of dollars.. making even the cabling for each card Very Expensive. Nevermind the network switches themselves..
Honestly, I thought the same thing upon reading the name. I'm aware of the reference to Antoni Gaudí, but having the name sound so close to gaudy seems a bit unfortunate. Surely they must've had better options? Then again I don't know how these sorts of names get decided anymore.
mk_stjames|1 year ago
With Nvidia, the SXM connection pinouts have always been held proprietary and confidential. For example, P100's and V100's have standard PCI-e lanes connected to one of the two sides of their MegArray connectors, and if you know that pinout you could literally build PCI-e cards with SXM2/3 connectors to repurpose those now obsolete chips (this has been done by one person).
There are thousands, maybe tens of thousands of P100's you could pickup for literally <$50 apiece these days which technically give you more Tflops/$ than anything on the market, but they are useless because their interface was not ever made open and has not been reverse engineered openly and the OEM baseboards (Dell, Supermicro mainly) are still hideously expensive outside China.
I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.
kkielhofner|1 year ago
See flash attention, triton, etc as core enabling libraries. Not to mention all of the custom CUDA kernels all over the place. Take all of this and then stack layers on top of them...
Unfortunately there is famously "GPU poor vs GPU rich". Pascal puts you at "GPU destitute" (regardless of assembled VRAM) and outside of implementations like llama.cpp that go incredible and impressive lengths to support these old archs you will very quickly run into show-stopping issues that make you wish you just handed over the money for >= 7.0.
I support any use of old hardware but this kind of reminds me of my "ancient" X5690 that has impressive performance (relatively speaking) but always bites me because it doesn't have AVX.
JonChesterfield|1 year ago
formerly_proven|1 year ago
> I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.
Very cool hobby. It’s also unfortunate how stringent e-waste rules lead to so much perfectly fine hardware to be scrapped. And how the remainder is typically pulled apart to the board / module level for spares. Makes it very unlikely to stumble over more or less complete-ish systems.
gymbeaux|1 year ago
Now that video gaming is secondary (tertiary?) to Nvidia’s revenue stream, they could give a shit which brand gamers prefer. It’s small time now. All that matters is who companies are buying their GPUs from for AI stuff. Break down that CUDA wall and it’s open-season. I wonder how they plan to stave that off. It’s only a matter of time before people get tired of writing C++ code to interface with CUDA.
buildbot|1 year ago
wmf|1 year ago
pavelstoev|1 year ago
lostmsu|1 year ago
neilmovva|1 year ago
The Gaudi 3 multi-chip package also looks interesting. I see 2 central compute dies, 8 HBM die stacks, and then 6 small dies interleaved between the HBM stacks - curious to know whether those are also functional, or just structural elements for mechanical support.
bayindirh|1 year ago
This is one of the secret recipes of Intel. They can use older tech and push it a little further to catch/surpass current gen tech until current gen becomes easier/cheaper to produce/acquire/integrate.
They have done it with their first quad core processors by merging two dual core processors (Q6xxx series), or by creating absurdly clocked single core processors aimed at very niche market segments.
We have not seen it until now, because they were sleeping at the wheel, and knocked unconscious by AMD.
tmikaeld|1 year ago
kylixz|1 year ago
I truly do hope it is successful so we can have some alternative accelerators.
jtriangle|1 year ago
fourg|1 year ago
cptskippy|1 year ago
iamleppert|1 year ago
astrodust|1 year ago
forkerenok|1 year ago
riffic|1 year ago
gymbeaux|1 year ago
riskable|1 year ago
WHAT‽ It's basically got the equivalent of a 24-port, 200-gigabit switch built into it. How does that make sense? Can you imaging stringing 24 Cat 8 cables between servers in a single rack? Wait: How do you even decide where those cables go? Do you buy 24 Gaudi 3 accelerators and run cables directly between every single one of them so they can all talk 200-gigabit ethernet to each other?
Also: If you've got that many Cat 8 cables coming out the back of the thing how do you even access it? You'll have to unplug half of them (better keep track of which was connected to what port!) just to be able to grab the shell of the device in the rack. 24 ports is usually enough to take up the majority of horizontal space in the rack so maybe this thing requires a minimum of 2-4U just to use it? That would make more sense but not help in the density department.
I'm imagining a lot of orders for "a gradient" of colors of cables so the data center folks wiring the things can keep track of which cable is supposed to go where.
blackeyeblitzar|1 year ago
> The Gaudi 3 accelerators inside of the nodes are connected using the same OSFP links to the outside world as happened with the Gaudi 2 designs, but in this case the doubling of the speed means that Intel has had to add retimers between the Ethernet ports on the Gaudi 3 cards and the six 800 Gb/sec OSFP ports that come out of the back of the system board. Of the 24 ports on each Gaudi 3, 21 of them are used to make a high-bandwidth all-to-all network linking those Gaudi 3 devices tightly to each other. Like this:
> As you scale, you build a sub-cluster with sixteen of these eight-way Gaudi 3 nodes, with three leaf switches – generally based on the 51.2 Tb/sec “Tomahawk 5” StrataXGS switch ASICs from Broadcom, according to Medina – that have half of their 64 ports running at 800 GB/sec pointing down to the servers and half of their ports pointing up to the spine network. You need three leaf switches to do the trick:
> To get to 4,096 Gaudi 3 accelerators across 512 server nodes, you build 32 sub-clusters and you cross link the 96 leaf switches with a three banks of sixteen spine switches, which will give you three different paths to link any Gaudi 3 to any other Gaudi 3 through two layers of network. Like this:
The cabling works out neatly in the rack configurations they envision. The idea here is to use standard Ethernet instead of proprietary Infiniband (which Nvidia got from acquiring Mellanox). Because each accelerator can reach other accelerators via multiple paths that will (ideally) not be over-utilized, you will be able to perform large operations across them efficiently without needing to get especially optimized about how your software manages communication.
gaogao|1 year ago
buildbot|1 year ago
juliangoldsmith|1 year ago
100GBe is only supported on twinax anyway, so Cat8 is irrelevant here. The other 3 ports are probably QSFP or something.
brookst|1 year ago
But I'm not how big and how expensive a 24 channel cat 8 snake would be (!).
parentheses|1 year ago
unknown|1 year ago
[deleted]
unknown|1 year ago
[deleted]
radicaldreamer|1 year ago
sairahul82|1 year ago
wongarsu|1 year ago
CuriouslyC|1 year ago
ipsum2|1 year ago
rileyphone|1 year ago
latchkey|1 year ago
kaycebasques|1 year ago
[1] https://en.wikipedia.org/wiki/Five_Ws
belval|1 year ago
latchkey|1 year ago
I hope to work on this for AMD MI300x soon. My company just got added to the MLCommons organization.
yieldcrv|1 year ago
Seems like an okay $8,000 - $30,000 investment, and bare metal server maintenance isn’t that complicated these days.
shiftpgdn|1 year ago
1024core|1 year ago
I didn't know "terabytes (TB)" was a unit of memory bandwidth...
throwup238|1 year ago
gnabgib|1 year ago
nahnahno|1 year ago
throwaway4good|1 year ago
InvestorType|1 year ago
"The Intel Gaudi 3 accelerator, architected for efficient large-scale AI compute, is manufactured on a 5 nanometer (nm) process"
ac29|1 year ago
geertj|1 year ago
meragrin_|1 year ago
https://uxlfoundation.org/
ZoomerCretin|1 year ago
einpoklum|1 year ago
alecco|1 year ago
wmf|1 year ago
KeplerBoy|1 year ago
ancharm|1 year ago
cavisne|1 year ago
https://docs.nvidia.com/cuda/parallel-thread-execution/index...
AnonMO|1 year ago
colechristensen|1 year ago
Think prototype consumer product with total cost preferably < $500, definitely less than $1000.
jsheard|1 year ago
Hugsun|1 year ago
I can't speak to the ease of configuration but know that some people have used these successfully.
JonChesterfield|1 year ago
hedgehog|1 year ago
dist-epoch|1 year ago
mirekrusin|1 year ago
jononor|1 year ago
wmf|1 year ago
MrYellowP|1 year ago
That's amusing. :D
sandGorgon|1 year ago
what is the programming interface here ? this is not CUDA right ...so how is this being done ?
wmf|1 year ago
chessgecko|1 year ago
andersa|1 year ago
amelius|1 year ago
InitEnabler|1 year ago
https://www.intel.com/content/dam/www/central-libraries/us/e...
wmf|1 year ago
KeplerBoy|1 year ago
Not the best of times for stuff that doesn't fit matrix processing units.
mpreda|1 year ago
metadat|1 year ago
How much does a single 200Gbit active (or inactive) fiber cable cost? Probably thousands of dollars.. making even the cabling for each card Very Expensive. Nevermind the network switches themselves..
Simultaneously impressive and disappointing.
lillecarl|1 year ago
If you're going fiber instead of twinax it's another order of magnitude and a bit for trancievers, but cables are pretty cheap still.
You seem to be loading negative energy into this release from the get-go
throwaway2037|1 year ago
YetAnotherNick|1 year ago
AnonMO|1 year ago
m3kw9|1 year ago
boroboro4|1 year ago
margaretanthony|1 year ago
[deleted]
jacksonhacker|1 year ago
[deleted]
brcmthrowaway|1 year ago
whalesalad|1 year ago
riazrizvi|1 year ago
jagger27|1 year ago
TheAceOfHearts|1 year ago