Why is this called a whitepaper, as this is more of a documentation and architecture overview of the cluster? Wow a CLOS topology for networking, very innovative.
Details on NVLink would be great. For example, the needs and problems solved by their custom cables seemingly required by NVLink would be worth a whitepaper.
Don't get me wrong, this is still great the general public can get a glimpse into Grace Hopper. And they do a good job of simplifying while throwing around mind-boggling numbers (the NVLink bandwidth is insane, though no words on latency, crucial for remote memory access).
NVDA has spent too much time surrounded by cryptocurrency hacks that published “whitepapers” left and right with zero technical information or innovation. As they say, never get high on your own supply.
What's funny is that even though the DGX GH200 is some of the most powerful hardware available, there's such a voracious demand that it's not gonna be enough to quench it. In fact, this is one of those cases where I think the demand will always outpace supply. Exciting stuff ahead.
I heard Elon say something interesting during the discussion/launch of xAI: "My prediction is that we will go from an extreme silicon shortage today, to probably a voltage-transformer shortage in about year, and then an electricity shortage in about a year, two years."
I'm not sure about the timeline, but it's an intriguing idea that soon the rate limiting resource will be electricity. I wonder how true that is and if we're prepared for that.
He’s just plain wrong about the electricity usage going up because of AI compute.
To a first approximation, the amount of silicon wafers going through fabs globally is constant. We won’t suddenly increase chip manufacturing a hundredfold! There aren’t enough fabs or “tools” like the ASML EUV machines for that.
Electricity is used for lots of things, not just compute, and within compute the AI fraction is tiny. We’re ramping up a rounding error to a slightly larger rounding error.
What will increase is global energy demand for overall economic activity as manufacturing and industry is accelerated by AIs.
Anyone who’s played games like Factorio would know intuitively that the only two real inputs to the economy are raw materials and energy. Increases to manufacturing speed need matching increases to energy supply!
It seems unlikely that anyone could afford the number of A100s needed to create an electricity shortage.
If there is an electricity shortage, far more likely that ageing infrastructure and rising demand for air conditioning and electric car charging are to blame.
The memory and bandwidth numbers are mind blowing. Going to be very hard to catch Nvidia. It’s as if competitors are going through the motions for participation prizes.
AMD has been shipping 128x lanes of PCIe 5.0 on chip. That's 0.5TBps. Getting up to 0.9TBps isn't that crazy, but having big enough fabric & switches to attach to is a huge feat.
I have hope though. CXL switching is going to give the whole industry a very fresh look at interconnect fabrics, as a simpler to manage faster more direct alternative to PCIe. Should be good.
Personally I worry it's flogging a dead horse, has too many constraints, but Ethernet could be rumbling into action again too maybe. The hyperscalers & others created a new LinuxFoundation group "Ultra Ethernet Scaling" to scale up much faster. Still, even at 1Tbps, that's a bunch of lanes (7x) of that ultra Ethernet you'd need to get to NVlink's 0.9TBps GPU interconnect. More radical breaks with Ethernet are needed than line speed bumps, things that can make switches easier to scale out big, if this realm of tech is to be good systems fabric. https://www.linuxfoundation.org/press/announcing-ultra-ether...
One interesting note on the DGX GH200 architecture that is super interesting to me is that it's inverted the connectivity relationship. Typically a system would have NIC & GPU hanging off the processor bus, and interconnect would go over that bus (maybe optimizing with p2p-dma to skip going through main memory, if it's fancy). But here? GPUs have a 0.9TBps connection to the NVswitch. If the CPU wants to talk to the cluster, it uses nvlink c2c to send the data to the gpu that then used it's nvlink connection to the NVswitch to send it out. Interesting reversal, interesting flourish, and gee it sure makes sense to me; the GPU is the thing!
Also, past 256 GPUs, there are BlueField 3 devices for Ethernet or infiniband connectivity on DGX nodes. Which is a good but also pretty boring/standard smartnic based scale out strategy.
I wonder how much this thing will cost, best I've been able to find so far is a 'low 8 digits' estimate in Anandtech article but nothing more specific than that.
Unfortunate that they don't mention the running times for any of the applications they benchmark (e.g., PageRank). Does anyone in the know have some idea how long this takes?
They claim 1.1x to 7x, depending on what you're doing. The 10% to 50% is for the ~10k GPU LLM training, where the main bottleneck tends to be networking:
> DGX GH200 enables more efficient parallel mapping and alleviates the networking communication bottleneck. As a result, up to 1.5x faster training time can be achieved over a DGX H100-based solution for LLM training at scale.
tuetuopay|2 years ago
Details on NVLink would be great. For example, the needs and problems solved by their custom cables seemingly required by NVLink would be worth a whitepaper.
Don't get me wrong, this is still great the general public can get a glimpse into Grace Hopper. And they do a good job of simplifying while throwing around mind-boggling numbers (the NVLink bandwidth is insane, though no words on latency, crucial for remote memory access).
mmaunder|2 years ago
That’s what a marketing white paper is and does. It’s not an academic paper.
callalex|2 years ago
syntaxing|2 years ago
smodad|2 years ago
I heard Elon say something interesting during the discussion/launch of xAI: "My prediction is that we will go from an extreme silicon shortage today, to probably a voltage-transformer shortage in about year, and then an electricity shortage in about a year, two years."
I'm not sure about the timeline, but it's an intriguing idea that soon the rate limiting resource will be electricity. I wonder how true that is and if we're prepared for that.
jiggawatts|2 years ago
To a first approximation, the amount of silicon wafers going through fabs globally is constant. We won’t suddenly increase chip manufacturing a hundredfold! There aren’t enough fabs or “tools” like the ASML EUV machines for that.
Electricity is used for lots of things, not just compute, and within compute the AI fraction is tiny. We’re ramping up a rounding error to a slightly larger rounding error.
What will increase is global energy demand for overall economic activity as manufacturing and industry is accelerated by AIs.
Anyone who’s played games like Factorio would know intuitively that the only two real inputs to the economy are raw materials and energy. Increases to manufacturing speed need matching increases to energy supply!
michaelt|2 years ago
An Nvidia A100 costs $10000 and consumes 300W.
It seems unlikely that anyone could afford the number of A100s needed to create an electricity shortage.
If there is an electricity shortage, far more likely that ageing infrastructure and rising demand for air conditioning and electric car charging are to blame.
callalex|2 years ago
swyx|2 years ago
mmaunder|2 years ago
jauntywundrkind|2 years ago
I have hope though. CXL switching is going to give the whole industry a very fresh look at interconnect fabrics, as a simpler to manage faster more direct alternative to PCIe. Should be good.
Personally I worry it's flogging a dead horse, has too many constraints, but Ethernet could be rumbling into action again too maybe. The hyperscalers & others created a new LinuxFoundation group "Ultra Ethernet Scaling" to scale up much faster. Still, even at 1Tbps, that's a bunch of lanes (7x) of that ultra Ethernet you'd need to get to NVlink's 0.9TBps GPU interconnect. More radical breaks with Ethernet are needed than line speed bumps, things that can make switches easier to scale out big, if this realm of tech is to be good systems fabric. https://www.linuxfoundation.org/press/announcing-ultra-ether...
One interesting note on the DGX GH200 architecture that is super interesting to me is that it's inverted the connectivity relationship. Typically a system would have NIC & GPU hanging off the processor bus, and interconnect would go over that bus (maybe optimizing with p2p-dma to skip going through main memory, if it's fancy). But here? GPUs have a 0.9TBps connection to the NVswitch. If the CPU wants to talk to the cluster, it uses nvlink c2c to send the data to the gpu that then used it's nvlink connection to the NVswitch to send it out. Interesting reversal, interesting flourish, and gee it sure makes sense to me; the GPU is the thing!
Also, past 256 GPUs, there are BlueField 3 devices for Ethernet or infiniband connectivity on DGX nodes. Which is a good but also pretty boring/standard smartnic based scale out strategy.
paskjdfparwerwe|2 years ago
The closest is Google with their TPUs.
jacquesm|2 years ago
https://www.anandtech.com/show/18877/nvidia-grace-hopper-has...
tikkun|2 years ago
[1]: (I wrote this) https://gpus.llm-utils.org/nvidia-h100-gpus-supply-and-deman...
tikkun|2 years ago
luc4sdreyer|2 years ago
LASR|2 years ago
On the LLM frontier, we’re starting to hit the limits of reasoning abilities in the current gen.
paskjdfparwerwe|2 years ago
moab|2 years ago
m3kw9|2 years ago
luc4sdreyer|2 years ago
> DGX GH200 enables more efficient parallel mapping and alleviates the networking communication bottleneck. As a result, up to 1.5x faster training time can be achieved over a DGX H100-based solution for LLM training at scale.
kvetching|2 years ago
unknown|2 years ago
[deleted]