Latency and cache coherency are the other things that make this hard. Cache coherency can theoretically be resolved by CXL, so maybe we’ll get there that way.
AI models do not need coherent memory, the access pattern is regular enough that you can make do with explicit barriers.
The bigger problem is that by the time PCIe 7.0 will be actually available, 242GB/s per direction will probably not be sufficient for anything interesting.
> AI models do not need coherent memory, the access pattern is regular enough that you can make do with explicit barriers.
response to both this and the sibling: even for training, I remember some speculation that explicit synchronization might not even be needed, especially in the middle stages of training (past the early part, before fine tuning). it's not "correct" but the gradient descent will eventually fix it anyway, as long as the error signal doesn't exceed the rate of gradient descent. And "error signal" here isn't just the noise itself but the error in gradient descent caused by incorrect sync - if the average delta in the descent is very slow, so is the amount of error it introduces, right?
actually iirc there was some thought that the noise might help bounce the model out of local optimums. little bit of a Simulated Annealing idea there.
> The bigger problem is that by the time PCIe 7.0 will be actually available, 242GB/s per direction will probably not be sufficient for anything interesting.
yea it's this - SemiAnalysis/Dylan Patel actually has been doing some great pieces on this.
background:
networks really don't scale past about 8-16 nodes. 8 is a hypercube, that's easy. You can do 16 with ringbus or xy-grid arrangements (although I don't think xy-grid has proven satisfactory for anything except systolic arrays imo). But as you increase the number of nodes past 8, link count blows out tremendously, bisection bandwidth stagnates, worst-case distance blows out tremendously, etc. So you want tiers, and you want them to be composed of nodes that are as large as you can make them, because you can't scale the node count infinitely. https://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/l...
Past about 16 nodes you just go to a switched-fabric thing that's a crossbar or a butterfly tree or whatever - and it's still a big fat chip itself, the 2nd gen nvswitch (pascal era) was about as many transistors as a Xeon 1650v1. They are on gen 3 or 4 now, and they have three of these in a giant mixing network (butterfly or fat tree or something) for just titanic amounts of interconnect between 2 racks. I don't even want to know what the switching itself pulls, it's not 300kW but it's definitely not insignificant either.
any HPC interconnect really needs to be a meaningful fraction of memory bandwidth speed if you want to treat it like a "single system". Doesn't have to be 100%, but like, needs to be 1/3 or 1/4 of normal memory BW at least. One of the theories around AMD's MCM patents was that the cache port and the interconnect port should be more or less the same thing - because you need to talk to interconnect at pretty much the same rate you talk to cache. So a cache chiplet and an interconnect could be pretty much the same thing in silicon (I guess today we'd say they're both Infinity Link - which is not the same as Infinity Fabric BTW, again the key difference being coherency). But that kinda frames the discussion in terms of the requirements here.
anyway, to your point: pcie/cxl will never be as fast as ethernet because the signal integrity requirements are orders of magnitude tighter, pcie is a very short-range link etc and requires comparatively larger PHY to drive it than ethernet does for the same bandwidth.
ethernet serdes apparently have 3x the bandwidth as PCIe (and CXL) serdes per mm of beachfront, and GPU networking highly favors bandwidth above almost any other concern (and they utterly don't care about latency). The denser you make the bandwidth, the more links (or fatter links) you can fit on a given chip. And more links basically translates to larger networks, meaning more total capacity, better TCO, etc. Sort of a gustafson's law thing.
(and it goes without saying that regardless, this all burns a tremendous amount of power. data movement is expensive.)
the shape this is taking is basically computronium. giant chips, massive interconnects between them. It's not that chiplet is a bad idea, but what's better than lots of little chiplets fused into a single processor? lots of big chiplets fused into a single processor.
And in fact that pattern gets repeated fractally. MI300X and B200 both take two big dies and fuse them together into what feels like a single GPU. Then you take a bunch of those GPUs and fuse those together into a local node via NVSwitch.
Standard HPC stuff... other than the density. They are thinking it might actually scale to at least 300 kW per rack... and efficiency actually improves when you do this (just like packaging!) because data movement is hideously expensive. You absolutely want to keep everything "local" (at every level) and talk over the interconnects as little as possible.
MLID interviewed an NVIDIA engineer after RDNA3 came out, iirc they more or less said "they looked at chiplet, they didn't think it was worth it yet, so they didn't do it. and gonna do chiplets in their own way, and aren't going to be constrained or limited to chasing the approaches AMD uses". And my interpretation of that quote has always been - they see what they are doing as building a giant GPU out of tons of "chiplets", where each chiplet is a H100 board or whatever. NVLink is their Infinity Link, Ethernet is their Infinity Fabric/IFOP.
The idea of a processor as a bunch of disaggregated tiny chiplets is great for yields, it's terrible for performance and efficiency. "tile" in the sense of having 64 tiles on your server processor is dead dead dead, tiles need to be decent-sized chunks of silicon on their own, because that reduces data movement a ton (and network node count etc). And while of course packaging lets you stack a couple dies... it also blows up power consumption in other areas, because if each chiplet is slower then you are moving more data around. The chip might be more efficient, but the system isn't as efficient as just building computronium out of big chunks.
it's been an obvious lesson from the start even with ryzen, and RDNA3 should have really driven it home: data movement (cross-CCX/cross-CCD) is both performance-bottlenecking and power-intensive, so making the chiplets too small is a mistake. Navi 32 is barely a viable product etc, and that's without even hitting the prices that most people want to see from it. Driving 7700XT down to $349 or $329 or whatever is really really tough (it'll get there during clearance but they can't do it during the prime of its life), and idle power/low-load power sucks. You want the chunks to be at least medium sized - and really as big as you can afford to make them. Yields get lower the bigger you get, of course, but does anybody care about yielded price right now? Frankly I am surprised NVIDIA isn't pursing Cerebras-style wafer-scale right now tbh.
again, lots of words to say: you want the nodes to be as fat as possible, because you probably only get 8 or 16 nodes per tier anyway. So the smaller you make the nodes, the less performance available at each tier. And that means slower systems with more energy spent moving data. The analyst (not NVIDIA)'s claim is that he thinks water-cooled 300kw would be more efficient than current systems.
(e: power consumption was 150-200 kW for Cray-2 so NVIDIA's got a ways to go (currently 100 kW, rumored 200 kW) to even to reach the peak of historical "make it bent so there's less data movement" style hyperdense designs. Tbh that makes me suspect that analyst is probably right, it's both possible and might well improve efficiency, but due to the data movement factors this time, rather than latency. Ironic.)
As long as the GPU-local memory can hold a couple layers at a time, I don't think the latency to the currently-inactive layers matters very much, only the bandwidth.
Tuna-Fish|1 year ago
The bigger problem is that by the time PCIe 7.0 will be actually available, 242GB/s per direction will probably not be sufficient for anything interesting.
paulmd|1 year ago
response to both this and the sibling: even for training, I remember some speculation that explicit synchronization might not even be needed, especially in the middle stages of training (past the early part, before fine tuning). it's not "correct" but the gradient descent will eventually fix it anyway, as long as the error signal doesn't exceed the rate of gradient descent. And "error signal" here isn't just the noise itself but the error in gradient descent caused by incorrect sync - if the average delta in the descent is very slow, so is the amount of error it introduces, right?
actually iirc there was some thought that the noise might help bounce the model out of local optimums. little bit of a Simulated Annealing idea there.
> The bigger problem is that by the time PCIe 7.0 will be actually available, 242GB/s per direction will probably not be sufficient for anything interesting.
yea it's this - SemiAnalysis/Dylan Patel actually has been doing some great pieces on this.
background:
networks really don't scale past about 8-16 nodes. 8 is a hypercube, that's easy. You can do 16 with ringbus or xy-grid arrangements (although I don't think xy-grid has proven satisfactory for anything except systolic arrays imo). But as you increase the number of nodes past 8, link count blows out tremendously, bisection bandwidth stagnates, worst-case distance blows out tremendously, etc. So you want tiers, and you want them to be composed of nodes that are as large as you can make them, because you can't scale the node count infinitely. https://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/l...
Past about 16 nodes you just go to a switched-fabric thing that's a crossbar or a butterfly tree or whatever - and it's still a big fat chip itself, the 2nd gen nvswitch (pascal era) was about as many transistors as a Xeon 1650v1. They are on gen 3 or 4 now, and they have three of these in a giant mixing network (butterfly or fat tree or something) for just titanic amounts of interconnect between 2 racks. I don't even want to know what the switching itself pulls, it's not 300kW but it's definitely not insignificant either.
any HPC interconnect really needs to be a meaningful fraction of memory bandwidth speed if you want to treat it like a "single system". Doesn't have to be 100%, but like, needs to be 1/3 or 1/4 of normal memory BW at least. One of the theories around AMD's MCM patents was that the cache port and the interconnect port should be more or less the same thing - because you need to talk to interconnect at pretty much the same rate you talk to cache. So a cache chiplet and an interconnect could be pretty much the same thing in silicon (I guess today we'd say they're both Infinity Link - which is not the same as Infinity Fabric BTW, again the key difference being coherency). But that kinda frames the discussion in terms of the requirements here.
https://hexus.net/tech/news/graphics/147643-amd-patent-outli...
https://community.amd.com/t5/general-discussions/amd-files-p...
anyway, to your point: pcie/cxl will never be as fast as ethernet because the signal integrity requirements are orders of magnitude tighter, pcie is a very short-range link etc and requires comparatively larger PHY to drive it than ethernet does for the same bandwidth.
ethernet serdes apparently have 3x the bandwidth as PCIe (and CXL) serdes per mm of beachfront, and GPU networking highly favors bandwidth above almost any other concern (and they utterly don't care about latency). The denser you make the bandwidth, the more links (or fatter links) you can fit on a given chip. And more links basically translates to larger networks, meaning more total capacity, better TCO, etc. Sort of a gustafson's law thing.
(and it goes without saying that regardless, this all burns a tremendous amount of power. data movement is expensive.)
https://www.semianalysis.com/p/cxl-is-dead-in-the-ai-era
the shape this is taking is basically computronium. giant chips, massive interconnects between them. It's not that chiplet is a bad idea, but what's better than lots of little chiplets fused into a single processor? lots of big chiplets fused into a single processor.
And in fact that pattern gets repeated fractally. MI300X and B200 both take two big dies and fuse them together into what feels like a single GPU. Then you take a bunch of those GPUs and fuse those together into a local node via NVSwitch.
https://www.semianalysis.com/p/nvidias-optical-boogeyman-nvl...
Standard HPC stuff... other than the density. They are thinking it might actually scale to at least 300 kW per rack... and efficiency actually improves when you do this (just like packaging!) because data movement is hideously expensive. You absolutely want to keep everything "local" (at every level) and talk over the interconnects as little as possible.
https://www.fabricatedknowledge.com/p/the-data-center-is-the...
MLID interviewed an NVIDIA engineer after RDNA3 came out, iirc they more or less said "they looked at chiplet, they didn't think it was worth it yet, so they didn't do it. and gonna do chiplets in their own way, and aren't going to be constrained or limited to chasing the approaches AMD uses". And my interpretation of that quote has always been - they see what they are doing as building a giant GPU out of tons of "chiplets", where each chiplet is a H100 board or whatever. NVLink is their Infinity Link, Ethernet is their Infinity Fabric/IFOP.
The idea of a processor as a bunch of disaggregated tiny chiplets is great for yields, it's terrible for performance and efficiency. "tile" in the sense of having 64 tiles on your server processor is dead dead dead, tiles need to be decent-sized chunks of silicon on their own, because that reduces data movement a ton (and network node count etc). And while of course packaging lets you stack a couple dies... it also blows up power consumption in other areas, because if each chiplet is slower then you are moving more data around. The chip might be more efficient, but the system isn't as efficient as just building computronium out of big chunks.
it's been an obvious lesson from the start even with ryzen, and RDNA3 should have really driven it home: data movement (cross-CCX/cross-CCD) is both performance-bottlenecking and power-intensive, so making the chiplets too small is a mistake. Navi 32 is barely a viable product etc, and that's without even hitting the prices that most people want to see from it. Driving 7700XT down to $349 or $329 or whatever is really really tough (it'll get there during clearance but they can't do it during the prime of its life), and idle power/low-load power sucks. You want the chunks to be at least medium sized - and really as big as you can afford to make them. Yields get lower the bigger you get, of course, but does anybody care about yielded price right now? Frankly I am surprised NVIDIA isn't pursing Cerebras-style wafer-scale right now tbh.
again, lots of words to say: you want the nodes to be as fat as possible, because you probably only get 8 or 16 nodes per tier anyway. So the smaller you make the nodes, the less performance available at each tier. And that means slower systems with more energy spent moving data. The analyst (not NVIDIA)'s claim is that he thinks water-cooled 300kw would be more efficient than current systems.
it's CRAY time! (no nvidia, no!!!)
https://en.wikipedia.org/wiki/Cray-2#/media/File:Cray2.jpg
https://en.wikipedia.org/wiki/File:EPFL_CRAY-II_2.jpg
https://en.wikipedia.org/wiki/File:Cray-2_module_side_view.j...
(e: power consumption was 150-200 kW for Cray-2 so NVIDIA's got a ways to go (currently 100 kW, rumored 200 kW) to even to reach the peak of historical "make it bent so there's less data movement" style hyperdense designs. Tbh that makes me suspect that analyst is probably right, it's both possible and might well improve efficiency, but due to the data movement factors this time, rather than latency. Ironic.)
foobiekr|1 year ago
Dylan16807|1 year ago
mlyle|1 year ago