I wonder how they manage to keep the FP64 units busy. Seems this is an HPC product, but many HPC apps are memory bound. So to improve FP64 perf by 4 one might need to improve DRAM bandwidth by 8-16x. Otherwise the units would only be stalled waiting for memory.
But it seems they did not improve bandwidth by much?
I don't know anything about the details here, but with usual linear algebra stuff, bandwidth depends on the size of the kernel that fits into the local memory inside whatever IC you use for your floating-point computation.
E.g. matrix multiplication of n×n square matrices has computational cost of n³ but bandwidth cost of n². Usuall a big m x m matrix is split into many blocks of n×n matrices (with m = k×n). If a n×n matrix fits into the local store of your CPU (cache or registers), then bandwidth cost for the m x m matrix product is k³×n×n = m×m×m/n, so the bigger the block-size 'n' that you can process inside the CPU, the less bandwidth you need.
I disagree. The website you linked to shows speed-ups on MI250X between 1.6x and 3x higher than A100. The theoretical memory bandwidth speed-up between MI250X and A100 is only 1.6X (3.2 TB/s vs 2.0 TB/s). Thus, I'd say they are seeing the advantage of higher FP64 compute in those applications.
> Seems this is an HPC product, but many HPC apps are memory bound.
The point of a supercomputer is to throw so much compute at a problem, that everything else is the bottleneck.
If an HPC app is memory-bound, then the GPU / Supercomputer was successful at its job. So many HPC apps are memory bound because... well... turns out our machines are actually quite good.
In any case, MI200 has 1.6x the bandwidth as the A100. So if you have a massively-parallel use-case that is memory bound, the MI200 line should have an advantage.
-------
The main issue IMO, is that the MI200's 1.6x bandwidth is really 80% bandwidth applied over two die, connected with a incredible amount of "infinity fabric" links to share the data. I have to imagine that the A100's larger design wins in some cases over the MI200's chiplet design.
I agree with you, which is why I don't really understand what the point of improving FP64 perf by 4x is, if that is not the bottleneck for many apps.
Per node, a 4x MI250X node has more or less the same BW as a DGX-A100 (8x A100).
It has 2x more FP64 compute, but for most science and engineering apps, which are memory bound, 2x more FP64 compute does not make these apps any faster.
dvdkhlng|4 years ago
E.g. matrix multiplication of n×n square matrices has computational cost of n³ but bandwidth cost of n². Usuall a big m x m matrix is split into many blocks of n×n matrices (with m = k×n). If a n×n matrix fits into the local store of your CPU (cache or registers), then bandwidth cost for the m x m matrix product is k³×n×n = m×m×m/n, so the bigger the block-size 'n' that you can process inside the CPU, the less bandwidth you need.
edit: formatting
my123|4 years ago
They don’t. See https://www.amd.com/en/graphics/server-accelerators-benchmar....
The MI250X, despite being dual big dies, doesn’t do especially well.
arcanus|4 years ago
volta83|4 years ago
dragontamer|4 years ago
The point of a supercomputer is to throw so much compute at a problem, that everything else is the bottleneck.
If an HPC app is memory-bound, then the GPU / Supercomputer was successful at its job. So many HPC apps are memory bound because... well... turns out our machines are actually quite good.
In any case, MI200 has 1.6x the bandwidth as the A100. So if you have a massively-parallel use-case that is memory bound, the MI200 line should have an advantage.
-------
The main issue IMO, is that the MI200's 1.6x bandwidth is really 80% bandwidth applied over two die, connected with a incredible amount of "infinity fabric" links to share the data. I have to imagine that the A100's larger design wins in some cases over the MI200's chiplet design.
volta83|4 years ago
Per node, a 4x MI250X node has more or less the same BW as a DGX-A100 (8x A100). It has 2x more FP64 compute, but for most science and engineering apps, which are memory bound, 2x more FP64 compute does not make these apps any faster.