Nvidia Announces H100 NVL – Max Memory Server Card for Large Language Models

neilmovva|2 years ago

A bit underwhelming - H100 was announced at GTC 2022, and represented a huge stride over A100. But a year later, H100 is still not generally available at any public cloud I can find, and I haven't yet seen ML researchers reporting any use of H100.

The new "NVL" variant adds ~20% more memory per GPU by enabling the sixth HBM stack (previously only five out of six were used). Additionally, GPUs now come in pairs with 600GB/s bandwidth between the paired devices. However, the pair then uses PCIe as the sole interface to the rest of the system. This topology is an interesting hybrid of the previous DGX (put all GPUs onto a unified NVLink graph), and the more traditional PCIe accelerator cards (star topology of PCIe links, host CPU is the root node). Probably not an issue, I think PCIe 5.0 x16 is already fast enough to not bottleneck multi-GPU training too much.

binarymax|2 years ago

It is interesting that hopper isn’t widely available yet.

I have seen some benchmarks from academia but nothing in the private sector.

I wonder if they thought they were moving too fast and wanted to milk amphere/ada as long as possible.

Not having any competition whatsoever means Nvidia can release what they like when they like.

__anon-2023__|2 years ago

Yes, I was expecting a RAM-doubled edition of the H100, this is just a higher-binned version of the same part.

I got an email from vultr, saying that they're "officially taking reservations for the NVIDIA HGX H100", so I guess all public clouds are going to get those soon.

rerx|2 years ago

You can also join a pair of regular PCIe H100 GPUs with an NVLink bridge. So that topology is not so new either.

ksec|2 years ago

>H100 was announced at GTC 2022, and represented a huge stride over A100. But a year later, H100 is still not generally available at any public cloud I can find

You can safely assume an entity bought as many as they could.

ecshafer|2 years ago

I was wondering today if we would start to see the reverse of this. Small ASICS or some kind of optimized for LLM Gpu for desktop / or maybe even laptops of mobile. It is evident I think that LLM are here to stay and will be a major part of computing for a while. Getting this local, so we aren't reliant on clouds would be a huge boon for personal computing. Even if its a "worse" experience, being able to load up an LLM into our computer, tell it to only look at this directory and help out would be cool.

wyldfire|2 years ago

In fact, Qualcomm has announced a "Cloud AI" PCIe card designed for inference (as opposed to training & inference) [1, 2]. It's populated with NSPs like the ones in mobile SoCs.

[1] https://www.qualcomm.com/products/technology/processors/clou...

[2] https://github.com/quic/software-kit-for-qualcomm-cloud-ai-1...

ethbr0|2 years ago

Software/hardware co-evolution. Wouldn't be the first time we went down that road to good effect.

For anything that can be run remotely, it'll always be deployed and optimized server-side first. Higher utilization means more economy.

Then trickle down to local and end user devices if it makes sense.

wmf|2 years ago

Apple, Intel, AMD, Qualcomm, Samsung, etc. already have "neural engines" in their SoCs. These engines continue to evolve to better support common types of models.

Sol-|2 years ago

Why is the sentiment here so much that LLMs will somehow be decentralized and run locally at some point? Has the story of the internet so far not been that centralization has pretty much always won?

01100011|2 years ago

A couple of the big players are already looking at developing their own chips.

enlyth|2 years ago

Please give us consumer cards with more than 24GB VRAM, Nvidia.

It was a slap in the face when the 4090 had the same memory capacity as the 3090.

A6000 is 5000 dollars, ain't no hobbyist at home paying for that.

andrewstuart|2 years ago

Nvidia don't want consumers using consumer GPUs for business.

If you are a business user then you must pay Nvidia gargantuan amounts of money.

This is the outcome of a market leader with no real competition - you pay much more for lower power than the consumer GPUs and you are forced into ujsing their business GPUs through software license restrictions on the drivers.

nullc|2 years ago

Nvidia can't do a large 'consumer' card without cannibalizing their commercial ML business. ATI doesn't have that problem.

ATI seems to be holding the idiot ball.

Port stable diffusion and clip to their hardware. Train an upsized version sized for a 48GB card. Release a prosumer 48gb card... get huge uptake from artists and creators using the tech.

andrewstuart|2 years ago

GPUs are going to be weird, underconfigured and overpriced until there is real competition.

Whether or not there is real competition depends entirely on whether Intels Arc line of GPUs stays in the market.

AMD strangely has decided not to compete. Its newest GPU the 7900 XTX is an extremely powerful card, close to the top of the line Nvidia RTX 4090 in raster performance.

If AMD had introduced it with an aggressively low price then then they could have wedged Nvidia, which is determinbed to exploit it's market dominance by squeezing the maximum money out of buyers.

Instead, AMD has decided to simply follow Nvidia in squeezing for maximum prices, with AM prices slightly behind Nvidia.

It's a strange decision from AMD who is well behind in market and apparently seems disinterested in increasing that market share by competing aggressively.

So a third player is needed - Intel - it's alot harder for three companies to sit on outrageously high prices for years rather than compete with each other for market share.

dragontamer|2 years ago

The root cause is that TSMC raised prices in everyone.

Since Intel GPUs are again TSMC manufactured, you really aren't going to see price improvements unless Intel subsidizes all of this.

enlyth|2 years ago

I suspect that the lack of CUDA is a dealbreaker for too many people when it comes to AMD, with the recent explosion in machine learning.

JonChesterfield|2 years ago

GPUs strike me as absurdly cheap given the performance they can offer. I'd just like them to be easier to program.

brucethemoose2|2 years ago

The really interesting upcoming LLM products are from AMD and Intel... with catches.

- The Intel Falcon Shores XPU is basically a big GPU that can use DDR5 DIMMS directly, hence it can fit absolutely enormous models into a single pool. But it has been delayed to 2025 :/

- AMD have not mentioned anything about the (not delayed) MI300 supporting DIMMs. If it doesn't, its capped to 128GB, and its being marketed as an HPC product like the MI200 anyway (which you basically cannot find on cloud services).

Nvidia also has some DDR5 grace CPUs, but the memory is embedded and I'm not sure how much of a GPU they have. Other startups (Tenstorrent, Cerebras, Graphcore and such) seemed to have underestimated the memory requirements of future models.

YetAnotherNick|2 years ago

> DDR5 DIMMS directly

That's the problem. Good DDR5 RAM's memory speed is <100GB/s, while nvidia could has up to 2TB/s, and still the bottleneck lies on memory speed for most applications.

virtuallynathan|2 years ago

Grace can be paired with Hopper via a 900GB/s NVLINK bus (500GB/s memory bandwidth), 1TB of LPDDR5 on the CPU and 80-94GB of HBM3 on the GPU.

int_19h|2 years ago

I wonder how soon we'll see something tailored specifically for local applications. Basically just tons of VRAM to be able to load large models, but not bleeding edge perf. And eGPU form factor, ideally.

frankchn|2 years ago

The Apple M-series CPUs with unified RAM is interesting in this regard. You can get an 16-inch MBP with an M2 Max 96GB of RAM for $4300 today, and I expect the M2 Ultra go to 192GB.

pixl97|2 years ago

I'm not a ML scientist my any means, but Perf seems as important as RAM from what I'm reading. Running prompts in internal chain of thought (eating up more TPU time) appears to give much better output.

aliljet|2 years ago

I'm super duper curious if there are ways to glob together VRAM between consumer-grade hardware to make this whole market more accessible to the common hacker?

rerx|2 years ago

You can, for instance, connect two RTX 3090 with an NVLink bridge. That gives you 48 GB in total. The 4090 doesn't support NVLink anymore.

bick_nyers|2 years ago

I remember reading about a guy who soldered 2GB VRAM modules on his 3060 12GB (replacing the 1GB modules) and was able to attain 24GB on that card. Or something along those lines.

metadat|2 years ago

How is this card (which is really two physical cards occupying 2 PCIe slots) exposed to the OS? Does it show up as a single /dev/gfx0 device, or is the unification a driver trick?

rerx|2 years ago

The two cards show as two distinct GPUs to the host, connected via NVLink. Unification / load balancing happens via software.

unknown|2 years ago

[deleted]

sargun|2 years ago

What exactly is an SXM5 socket? It sounds like a PCIe competitor, but proprietary to nvidia. Looking at it, it seems specific to nvidia DGX (mother?)boards. Is this just a "better" alternative to PCIe (with power delivery, and such), or fundamentally a new technology?

koheripbal|2 years ago

Yes to all your questions. It's specifically designed for commercial compute servers. It provides significantly more bandwidth and speed over PCIe.

It's also enormously more expensive and I'm not sure if you can buy it new without getting the nvidia compute server.

0xbadc0de5|2 years ago

It's one of those /If you have to ask, you can't afford it/ scenarios.

tromp|2 years ago

The TDP row in the comparison table must be in error. It shows the card with dual GH100 GPUs at 700W and the one with a single GH100 GPU at 700-800W ?!

rerx|2 years ago

That's the SXM version, used for instance in servers like the DGX. It's also faster than the PCIe variation.

0xbadc0de5|2 years ago

So it's essentially two H100's in a trenchcoat? (plus a sprinkling of "latest")

ipsum2|2 years ago

I would sell a kidney for one of these. It's basically impossible to train language models on a consumer 24GB card. The jump up is the A6000 ADA, at 48GB for $8,000. This one will probably be priced somewhere in the $100k+ range.

YetAnotherNick|2 years ago

Use 4 consumer grade 4090 then. It would be much cheaper and better in almost every aspect. Also even with this, forget about training foundational models. Meta spent 82k GPU hours on the smallest llama and 1M hours on largest.

solarmist|2 years ago

You think? It’s double 48 GB (per card) so why wouldn’t it be in the $20k range?

eliben|2 years ago

NVIDIA is selling shovels in a gold rush. Good for them. Their P/E of 150 is frightening, though.

jiggawatts|2 years ago

I was just saying to a colleague the day before this announcement that the inevitable consequence of the popularity of large language models will be GPUs with more memory.

Previously, GPUs were designed for gamers, and no game really "needs" more than 16 GB of VRAM. I've seen reviews of the A100 and H100 cards saying that the 80GB is ample for even the most demanding usage.

Now? Suddenly GPUs with 1 TB of memory could be immediately used, at scale, by deep-pocket customers happy to throw their entire wallets at NVIDIA.

This new H100 NVL model is a Frankenstein's monster stitched together from whatever they had lying around. It's a desperate move to corner the market early as possible. It's just the beginning, a preview of the times to come.

There will be a new digital moat, a new capitalist's empire, built upon on the scarcity of cards "big enough" to run models that nobody but a handful of megacorps can afford to train.

In fact, it won't be enough to restrict access by making the models expensive to train. The real moat will be models too expensive to run. Users will have to sign up, get API keys, and stand in line.

"Safe use of AI" my ass. Safe profits, more like. Safe monopolies, safe from competition.

g42gregory|2 years ago

I wonder how this compares to AMD Instinct MI300 128GB HBM3 cards?

tpmx|2 years ago

Does AMD have a chance here in the short term (say 24 months)?

Symmetry|2 years ago

AMD seems to be focusing on traditional HPC, they've got a ton of 64 bit flops in their recent commercial model. I expect their server GPUs are mostly for chasing supercomputer contracts, which can be pretty lucrative, while they cede model training to NVidia.

garbagecoder|2 years ago

Sarah Connor is totally coming for NVIDIA.

107 comments