(no title)
m4r1k | 3 months ago
To quote The Next Platform: "An Ironwood cluster linked with Google’s absolutely unique optical circuit switch interconnect can bring to bear 9,216 Ironwood TPUs with a combined 1.77 PB of HBM memory... This makes a rackscale Nvidia system based on 144 “Blackwell” GPU chiplets with an aggregate of 20.7 TB of HBM memory look like a joke."
Nvidia may have the superior architecture at the single-chip level, but for large-scale distributed training (and inference) they currently have nothing that rivals Google's optical switching scalability.
thelastgallon|3 months ago
jauntywundrkind|3 months ago
agumonkey|3 months ago
solumunus|3 months ago
gigatexal|3 months ago
calaphos|3 months ago
For example the currently very popular Mixture of Experts architectures require a lot of all to all traffic (for expert parallelism) which works a lot better on the switched NVlink fabric as opposed where it doesn't need to traverse multiple links in the torus.
zamadatix|3 months ago
markhahn|3 months ago
Bisection bandwidth is a useful metric, but is hop count? Per-hop cost tends to be pretty small.
benreesman|3 months ago
NVFP4 is to put it mildly a masterpiece, the UTF-8 of its domain and in strikingly similar ways it is 1. general 2. robust to gross misuse 3. not optional if success and cost both matter.
It's not a gap that can be closed by a process node or an architecture tweak: it's an order of magnitude where the polynomials that were killing you on the way up are now working for you.
sm_120 (what NVIDIA's quiet repos call CTA1) consumer gear does softmax attention and projection/MLP blockscaled GEMM at a bit over a petaflop at 300W and close to two (dense) at 600W.
This changes the whole game and it's not clear anyone outside the lab even knows the new equilibrium points, it's nothing like Flash3 on Hopper, lotta stuff looks FLOPs bound, GDDR7 looks like a better deal than HBMe3. The DGX Spark is in no way deficient, it has ample memory bandwidth.
This has been in the pipe for something like five years and even if everyone else started at the beginning of the year when this was knowable, it would still be 12-18 months until tape out. And they haven't started.
Years Until Anyone Can Compete With NVIDIA is back up to the 2-5 it was 2-5 years ago.
This was supposed to be the year ROCm and the new Intel stuff became viable.
They had a plan.
Voultapher|3 months ago
So if we look at what NVIDIA has to say about NVFP4 it sure sounds impressive [1]. But look closely that initial graph never compares fp8 and fp4 on the same hardware. They jump from H100 to B200 while implying a 5x jump of going with fp4 which it isn't. Accompanied with scary words like if you use MXFP4 "Risk of noticeable accuracy drop compared to FP8" .
Contrast that with what AMD has to say on the open MXFP4 approach which is quite similar to NVFP4 [2]. Ohh the horrors of getting 79.6 instead of 79.9 on GPQA Diamond when using MXFP4 instead of FP8.
[1] https://developer.nvidia.com/blog/introducing-nvfp4-for-effi...
[2] https://rocm.blogs.amd.com/software-tools-optimization/mxfp4...
frutiger|3 months ago
mrbungie|3 months ago
[1] https://x.com/nvidianewsroom/status/1993364210948936055
qcnguy|3 months ago
bigyabai|3 months ago
The tweet gives their justification; CUDA isn't ASIC. Nvidia GPUs were popular for crypto mining, protein folding, and now AI inference too. TPUs are tensor ASICs.
FWIW I'm inclined to agree with Nvidia here. Scaling up a systolic array is impressive but nothing new.
almostgotcaught|3 months ago
a generation is 6 months
7e|3 months ago
It’s better to have a faster, smaller network for model parallelism and a larger, slower one for data parallelism than a very large, but slower, network for everything. This is why NVIDIA wins.
joshuamorton|3 months ago
sheepscreek|3 months ago
^Even now I get capacity related error messages, so many days after the Gemini 3 launch. Also, Jules is basically unusable. Maybe Gemini 3 is a bigger resource hog than anyone outside of Google realizes.
flyinglizard|3 months ago
veunes|3 months ago
unknown|3 months ago
[deleted]
villgax|3 months ago
m4r1k|3 months ago
While the B200 wins on raw FP8 throughput (~9000 vs 4614 TFLOPs), that makes sense given NVIDIA has optimized for the single-chip game for over 20 years. But the bottleneck here isn't the chip—it's the domain size.
NVIDIA's top-tier NVL72 tops out at an NVLink domain of 72 Blackwell GPUs. Meanwhile, Google is connecting 9216 chips at 9.6Tbps to deliver nearly 43 ExaFlops. NVIDIA has the ecosystem (CUDA, community, etc.), but until they can match that interconnect scale, they simply don't compete in this weight class.
NaomiLehman|3 months ago
croon|3 months ago