Intel Gaudi 3 AI Accelerator

mk_stjames|1 year ago

One nice thing about this (and the new offerings from AMD) is that they will be using the "open accelerator module (OAM)" interface- which standardizes the connector that they use to put them on baseboards, similar to the SXM connections of Nvidia that use MegArray connectors to thier baseboards.

With Nvidia, the SXM connection pinouts have always been held proprietary and confidential. For example, P100's and V100's have standard PCI-e lanes connected to one of the two sides of their MegArray connectors, and if you know that pinout you could literally build PCI-e cards with SXM2/3 connectors to repurpose those now obsolete chips (this has been done by one person).

There are thousands, maybe tens of thousands of P100's you could pickup for literally <$50 apiece these days which technically give you more Tflops/$ than anything on the market, but they are useless because their interface was not ever made open and has not been reverse engineered openly and the OEM baseboards (Dell, Supermicro mainly) are still hideously expensive outside China.

I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.

kkielhofner|1 year ago

Pascal series are cheap because they are CUDA compute capability 6.0 and lack Tensor Cores. Volta (7.0) was the first to have Tensor Cores and in many cases is the bare minimum for modern/current stacks.

See flash attention, triton, etc as core enabling libraries. Not to mention all of the custom CUDA kernels all over the place. Take all of this and then stack layers on top of them...

Unfortunately there is famously "GPU poor vs GPU rich". Pascal puts you at "GPU destitute" (regardless of assembled VRAM) and outside of implementations like llama.cpp that go incredible and impressive lengths to support these old archs you will very quickly run into show-stopping issues that make you wish you just handed over the money for >= 7.0.

I support any use of old hardware but this kind of reminds me of my "ancient" X5690 that has impressive performance (relatively speaking) but always bites me because it doesn't have AVX.

JonChesterfield|1 year ago

I really like this side to AMD. There's a strategic call somewhere high up to bias towards collaboration with other companies. Sharing the fabric specifications with broadcom was an amazing thing to see. It's not out of the question that we'll see single chips with chiplets made by different companies attached together.

formerly_proven|1 year ago

The price is low because they’re useless (except for replacing dead cards in a DGX), if you had a 40$ PCIe AIC-to-SXM adapter, the price would go up a lot.

> I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.

Very cool hobby. It’s also unfortunate how stringent e-waste rules lead to so much perfectly fine hardware to be scrapped. And how the remainder is typically pulled apart to the board / module level for spares. Makes it very unlikely to stumble over more or less complete-ish systems.

gymbeaux|1 year ago

As “humble” as NVIDIA’s CEO appears to be, NVIDIA the company (he’s been running this whole time), made decision after decision with the simple intention of killing off its competition (ATI/AMD). Gameworks is my favorite example- essentially if you wanted a video game to look as good as possible, you needed an NVIDIA GPU. Those same games played on AMD GPUs just didn’t look as good.

Now that video gaming is secondary (tertiary?) to Nvidia’s revenue stream, they could give a shit which brand gamers prefer. It’s small time now. All that matters is who companies are buying their GPUs from for AI stuff. Break down that CUDA wall and it’s open-season. I wonder how they plan to stave that off. It’s only a matter of time before people get tired of writing C++ code to interface with CUDA.

buildbot|1 year ago

The SXM2 interface is actually publicly documented! There is an open compute spec for a 8-way baseboard. You can find the pinouts there.

wmf|1 year ago

Why don't they sell used P100 DGX/HGX servers as a unit? Are those bare P100s only so cheap precisely because they're useless?

pavelstoev|1 year ago

Best Tflops/$ is actually 4090, then 3090. Also L4

lostmsu|1 year ago

P100s would not give you more Tflops/$ if you take electricity into account.

neilmovva|1 year ago

A bit surprised that they're using HBM2e, which is what Nvidia A100 (80GB) used back in 2020. But Intel is using 8 stacks here, so Gaudi 3 achieves comparable total bandwidth (3.7TB/s) to H100 (3.4TB/s) which uses 5 stacks of HBM3. Hopefully the older HBM has better supply - HBM3 is hard to get right now!

The Gaudi 3 multi-chip package also looks interesting. I see 2 central compute dies, 8 HBM die stacks, and then 6 small dies interleaved between the HBM stacks - curious to know whether those are also functional, or just structural elements for mechanical support.

bayindirh|1 year ago

> A bit surprised that they're using HBM2e, which is what Nvidia A100 (80GB) used back in 2020.

This is one of the secret recipes of Intel. They can use older tech and push it a little further to catch/surpass current gen tech until current gen becomes easier/cheaper to produce/acquire/integrate.

They have done it with their first quad core processors by merging two dual core processors (Q6xxx series), or by creating absurdly clocked single core processors aimed at very niche market segments.

We have not seen it until now, because they were sleeping at the wheel, and knocked unconscious by AMD.

tmikaeld|1 year ago

I was just about to comment on this, apparently all production capacity for hbm is tapped out until early 2026

kylixz|1 year ago

This is a bit snarky — but will Intel actually keep this product line alive for more than a few years? Having been bitten by building products around some of their non-x86 offerings where they killed good IP off and then failed to support it… I’m skeptical.

I truly do hope it is successful so we can have some alternative accelerators.

jtriangle|1 year ago

The real question is, how long does it actually have to hang around really? With the way this market is going, it probably only has to be supported in earnest for a few years by which point it'll be so far obsolete that everyone who matters will have moved on.

fourg|1 year ago

What’s Next: Intel Gaudi 3 accelerators' momentum will be foundational for Falcon Shores, Intel’s next-generation graphics processing unit (GPU) for AI and high-performance computing (HPC). Falcon Shores will integrate the Intel Gaudi and Intel® Xe intellectual property (IP) with a single GPU programming interface built on the Intel® oneAPI specification.

cptskippy|1 year ago

I think it's a valid question. Intel has a habit of whispering away anything that doesn't immediately ship millions of units or that they're contractually obligated to support.

iamleppert|1 year ago

Long enough for you to get in, develop some AI product, raise investment funds, and get out with your bag!

astrodust|1 year ago

I hope it pairs well with Optane modules!

forkerenok|1 year ago

I'm not very involved in the broader topic, but isn't the shortage of hardware for AI-related workloads intense enough so as to grant them the benefit of the doubt?

riffic|1 year ago

Itanic was a fun era

gymbeaux|1 year ago

I haven’t read the article but my first question would be “what problem is this accelerator solving?” and if the answer is simply “you can AI without Nvidia”, that’s not good enough, because that’s the pot calling the kettle black. None of these companies is “altruistic” but between the three of them I expect AMD to be the nicest to its customers. Nvidia will squeeze the most money out of theirs, and Intel will leave theirs out to dry when corporate leadership decides it’s a failure.

riskable|1 year ago

> Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 accelerator

WHAT‽ It's basically got the equivalent of a 24-port, 200-gigabit switch built into it. How does that make sense? Can you imaging stringing 24 Cat 8 cables between servers in a single rack? Wait: How do you even decide where those cables go? Do you buy 24 Gaudi 3 accelerators and run cables directly between every single one of them so they can all talk 200-gigabit ethernet to each other?

Also: If you've got that many Cat 8 cables coming out the back of the thing how do you even access it? You'll have to unplug half of them (better keep track of which was connected to what port!) just to be able to grab the shell of the device in the rack. 24 ports is usually enough to take up the majority of horizontal space in the rack so maybe this thing requires a minimum of 2-4U just to use it? That would make more sense but not help in the density department.

I'm imagining a lot of orders for "a gradient" of colors of cables so the data center folks wiring the things can keep track of which cable is supposed to go where.

blackeyeblitzar|1 year ago

See https://www.nextplatform.com/2024/04/09/with-gaudi-3-intel-c... for more details. Here’s the relevant bits, although you should visit the article to see the networking diagrams:

> The Gaudi 3 accelerators inside of the nodes are connected using the same OSFP links to the outside world as happened with the Gaudi 2 designs, but in this case the doubling of the speed means that Intel has had to add retimers between the Ethernet ports on the Gaudi 3 cards and the six 800 Gb/sec OSFP ports that come out of the back of the system board. Of the 24 ports on each Gaudi 3, 21 of them are used to make a high-bandwidth all-to-all network linking those Gaudi 3 devices tightly to each other. Like this:

> As you scale, you build a sub-cluster with sixteen of these eight-way Gaudi 3 nodes, with three leaf switches – generally based on the 51.2 Tb/sec “Tomahawk 5” StrataXGS switch ASICs from Broadcom, according to Medina – that have half of their 64 ports running at 800 GB/sec pointing down to the servers and half of their ports pointing up to the spine network. You need three leaf switches to do the trick:

> To get to 4,096 Gaudi 3 accelerators across 512 server nodes, you build 32 sub-clusters and you cross link the 96 leaf switches with a three banks of sixteen spine switches, which will give you three different paths to link any Gaudi 3 to any other Gaudi 3 through two layers of network. Like this:

The cabling works out neatly in the rack configurations they envision. The idea here is to use standard Ethernet instead of proprietary Infiniband (which Nvidia got from acquiring Mellanox). Because each accelerator can reach other accelerators via multiple paths that will (ideally) not be over-utilized, you will be able to perform large operations across them efficiently without needing to get especially optimized about how your software manages communication.

gaogao|1 year ago

Infiniband I've heard as incredibly annoying to deal with procuring as well as some other aspects of it, so lots of folks very happy to get RoCE (ethernet) working instead, even if it is a bit cumbersome.

buildbot|1 year ago

200gb is not going to be using CAT, it will be fiber (or direct attached copper cable as noted by dogma1138) with a QSFP interface

juliangoldsmith|1 year ago

For Gaudi2, it looks like 21/24 ports are internal to the server. I highly doubt those have actual individual cables. Most likely they're just carried on PCBs like any other signal.

100GBe is only supported on twinax anyway, so Cat8 is irrelevant here. The other 3 ports are probably QSFP or something.

brookst|1 year ago

Audio folks solved the "which cable goes where" problem ages ago with cable snakes: https://www.seismicaudiospeakers.com/products/24-channel-xlr...

But I'm not how big and how expensive a 24 channel cat 8 snake would be (!).

parentheses|1 year ago

Rainbow parens, meet rainbow tables.

unknown|1 year ago

[deleted]

unknown|1 year ago

[deleted]

radicaldreamer|1 year ago

The amount of power that will use up is massive, they should've gone for some fiber instead

sairahul82|1 year ago

Can we expect the price of 'Gaudi 3 PCIe' to be reasonable enough to put in a workstation? That would be a game changer for local LLMs

wongarsu|1 year ago

Probably not. An 40GB Nvidia A100 is arguably reasonable for a workstation at $6000. Depending on your definition an 80GB A100 for $16000 is still reasonable. I don't see this being cheaper than an 80GB A100. Probably a good bit more expensive, seeing as it has more RAM, compares itself favorably to the H100, and has enough compelling features that it probably doesn't have to (strongly) compete on price.

CuriouslyC|1 year ago

Just based on the RAM alone, let's just say if you can't just buy a Vision Pro without a second thought about the price tag, don't get your hopes up.

ipsum2|1 year ago

It won't be under $10k.

rileyphone|1 year ago

128GB in one chip seems important with the rise of sparse architectures like MoE. Hopefully these are competitive with Nvidia's offerings, though in the end they will be competing for the same fab space as Nvidia if I'm not mistaken.

latchkey|1 year ago

AMD MI300x is 192GB.

kaycebasques|1 year ago

Wow, I very much appreciate the use of the 5 Ws and H [1] in this announcement. Thank you Intel for not subjecting my eyes to corp BS

[1] https://en.wikipedia.org/wiki/Five_Ws

belval|1 year ago

I wonder if with the advent of LLMs being able to spit out perfect corpo-speak everyone will recenter to succint and short "here's the gist" as the long version will become associated to cheap automated output.

latchkey|1 year ago

> the only MLPerf-benchmarked alternative for LLMs on the market

I hope to work on this for AMD MI300x soon. My company just got added to the MLCommons organization.

yieldcrv|1 year ago

Has anyone here bought an AI accelerator to run their AI SaaS service from their home to customers instead of trying to make a profit on top of OpenAI or Replicate

Seems like an okay $8,000 - $30,000 investment, and bare metal server maintenance isn’t that complicated these days.

shiftpgdn|1 year ago

Dingboard runs off of the owner's pile of used gamer cards. The owner frequently posts about it on twitter.

1024core|1 year ago

> Memory Boost for LLM Capacity Requirements: 128 gigabytes (GB) of HBMe2 memory capacity, 3.7 terabytes (TB) of memory bandwidth ...

I didn't know "terabytes (TB)" was a unit of memory bandwidth...

throwup238|1 year ago

It’s equivalent to about thirteen football fields per arn if that helps.

gnabgib|1 year ago

Bit of an embarrassing typo, they do later qualify it as 3.7TB/s

nahnahno|1 year ago

About as relevant a measure of speed as parsecs

throwaway4good|1 year ago

Worth noting that it is fabbed by TSMC.

InvestorType|1 year ago

This appears to be manufactured by TSMC (or Samsung). The press release says it will use a 5nm process, which is not on Intel's roadmap.

"The Intel Gaudi 3 accelerator, architected for efficient large-scale AI compute, is manufactured on a 5 nanometer (nm) process"

ac29|1 year ago

Habana was an acquisition and their use of TSMC predates the acquisition.

geertj|1 year ago

I wonder if someone knowledgeable could comment on OneAPI vs Cuda. I feel like if Intel is going to be a serious competitor to Nvidia, both software and hardware are going to be equally important.

meragrin_|1 year ago

Apparently, Google, Qualcomm, Samsung, and ARM are rallying around oneAPI:

https://uxlfoundation.org/

ZoomerCretin|1 year ago

I'm not familiar with the particulars of OneAPI, but it's just a matter of rewriting CUDA kernels into OneAPI. This is pretty trivial for the vast majority of small (<5 LoC) kernels. Unlike AMD, it looks like they're serious about dogfooding their own chips, and they have a much better reputation for their driver quality.

einpoklum|1 year ago

If your metric is memory bandwidth or memory size, then this announcement gives you some concrete information. But - suppose my metric for performance is matrix-multiply-add (or just matrix-multiply) bandwidth. What MMA primitives does Gaudi offer (i.e. type combinations and matrix dimension combinations), and how many of such ops per second, in practice? The linked page says "64,000 in parallel", but that does not actually tell me much.

alecco|1 year ago

Gaudi 3 has PCIe 4.0 (vs. H100 PCIe 5.0, so 2x the bandwidth). Probably not a deal-breaker but it's strange for Intel (of all vendors) to lag behind in PCIe.

wmf|1 year ago

N5, PCIe 4.0, and HBM2e. This chip was probably delayed two years.

KeplerBoy|1 year ago

The whitepaper says it's PCIe 5 on Gaudi 3.

ancharm|1 year ago

Is the scheduling / bare metal software open source through OneAPI? Can a link be posted showing it if so?

cavisne|1 year ago

Is there an equivalent to this reference for Intel Gaudi?

https://docs.nvidia.com/cuda/parallel-thread-execution/index...

AnonMO|1 year ago

it's crazy that Intel can't manufacture its own chips atm, but it looks like that might change in the coming years as new fabs come online.

colechristensen|1 year ago

Anyone have experience and suggestions for an AI accelerator?

Think prototype consumer product with total cost preferably < $500, definitely less than $1000.

jsheard|1 year ago

The default answer is to get the biggest Nvidia gaming card you can afford, prioritizing VRAM size over speed. Ideally one of the 24GB ones.

Hugsun|1 year ago

You can get very cheap tesla P40s with 24gb of ram. They are much much slower than the newer cards but offer decent value for running a local chatbot.

I can't speak to the ease of configuration but know that some people have used these successfully.

JonChesterfield|1 year ago

I liked my 5700XT. That seems to be $200 now. Ran arbitrary code on it just fine. Lots of machine learning seems to be obsessed with amount of memory though and increasing that is likely to increase the price. Also HN doesn't like ROCm much, so there's that.

hedgehog|1 year ago

What else in on the BOM? Volume? At that price you likely want to use whatever resources are on the SoC that runs the thing and work around that. Feel free to e-mail me.

dist-epoch|1 year ago

All new CPUs will have so called NPUs inside them. For helping running models locally.

mirekrusin|1 year ago

Rent or 3090, maybe used 4090 if you're lucky.

jononor|1 year ago

What is the workload?

wmf|1 year ago

AMD Hawk Point?

MrYellowP|1 year ago

https://www.dwds.de/wb/Gaudi

That's amusing. :D

sandGorgon|1 year ago

>Intel Gaudi software integrates the PyTorch framework and provides optimized Hugging Face community-based models – the most-common AI framework for GenAI developers today. This allows GenAI developers to operate at a high abstraction level for ease of use and productivity and ease of model porting across hardware types.

what is the programming interface here ? this is not CUDA right ...so how is this being done ?

wmf|1 year ago

PyTorch has a bunch of backends including CUDA, ROCm, OneAPI, etc.

chessgecko|1 year ago

I feel a little misled by the speedup numbers. They are comparing lower batch size h100/200 numbers to higher batch size gaudi 3 numbers for throughput (which is heavily improved by increasing batch size). I feel like there are some inference scenarios where this is better, but its really hard to tell from the numbers in the paper.

andersa|1 year ago

Price?

amelius|1 year ago

Missing in these pictures are the thermal management solutions.

InitEnabler|1 year ago

If you look at one of the pictures you can get a peak at what they look like (I think...) in the bottom right.

https://www.intel.com/content/dam/www/central-libraries/us/e...

wmf|1 year ago

It's going to look very similar to an Nvidia SXM or AMD MI300 heatsink since these all have similar form factors.

KeplerBoy|1 year ago

vector floating point performance comes in at 14 Tflops/s for FP32 and 28 Tflop/s for FP16.

Not the best of times for stuff that doesn't fit matrix processing units.

mpreda|1 year ago

How much does one such card cost?

metadat|1 year ago

> Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 accelerator

How much does a single 200Gbit active (or inactive) fiber cable cost? Probably thousands of dollars.. making even the cabling for each card Very Expensive. Nevermind the network switches themselves..

Simultaneously impressive and disappointing.

lillecarl|1 year ago

https://www.fs.com/de-en/products/115636.html 2 meters seems to be about 100$, which isn't unreasonable.

If you're going fiber instead of twinax it's another order of magnitude and a bit for trancievers, but cables are pretty cheap still.

You seem to be loading negative energy into this release from the get-go

throwaway2037|1 year ago

What do you mean by active vs inactive fiber cable? I tried to Google about this distinction, but I couldn't find anything helpful.

YetAnotherNick|1 year ago

So now hardware companies stopped reporting FLOP/s number and reports in arbitrary unit of parallel operation/s.

AnonMO|1 year ago

1835 tflops fp8. you have to look for it, but they posted it. The link in the op is just an announcement. the white paper has more info. https://www.intel.com/content/www/us/en/content-details/8174...

m3kw9|1 year ago

Can you run Cuda on it?

boroboro4|1 year ago

No one runs Cuda, everyone runs PyTorch. Which you can run on it.

margaretanthony|1 year ago

[deleted]

jacksonhacker|1 year ago

[deleted]

brcmthrowaway|1 year ago

Does this support apple silicon?

whalesalad|1 year ago

https://www.merriam-webster.com/dictionary/gaudy

riazrizvi|1 year ago

That’s an i. He’s one the the greatest architects of all time. https://www.archdaily.com/877599/10-must-see-gaudi-buildings...

jagger27|1 year ago

https://en.wikipedia.org/wiki/Antoni_Gaud%C3%AD

TheAceOfHearts|1 year ago

Honestly, I thought the same thing upon reading the name. I'm aware of the reference to Antoni Gaudí, but having the name sound so close to gaudy seems a bit unfortunate. Surely they must've had better options? Then again I don't know how these sorts of names get decided anymore.

250 comments