That's a bit of an apples-and-oranges comparison. Cloud services normally have different design goals.
HPC workloads are often focused on highly-parallel jobs, with high-speed and (especially) low-latency communications between nodes. Fun fact: In the NVIDIA DGX SuperPOD Reference Architecture, each DGX H100 system (which has eight H100 GPUs per system) has four Infiniband NDR OSFP ports dedicated to GPU traffic. IIRC, each OSFP port operates at 200 Gbps (two lanes of 100 Gbps), allowing each GPU to effectively have its own IB port for GPU-to-GPU traffic.
(NVIDIA's not the only group doing that, BTW: Stanford's Sherlock 4.0 HPC environment[2], in their GPU-heavy servers, also uses multiple NDR ports per system.)
Solutions like that are not something you'll typically find in your typical cloud provider.
Early cloud-based HPC-focused solutions centered on workload locality, not just within a particular zone but with a particular part of a zone, with things like AWS Placement Groups[3]. More-modern Ethernet-based providers will give you guides like [4], telling you how to supplement placement groups with directly-accessible high-bandwidth network adapters, and in particular support for RDMA [4] or RoCE (RDMA over Converged Ethernet), which aims to provide IB-like functionality over Ethernet.
IMO, the closest analog you'll find in the cloud, to environments like Frontier, is going to be IB-based cloud environments from Azure HPC ('general' cloud) [5] and specialty-cloud folks like Lambda Labs [6].
johnklos|1 year ago
"Cloud" doesn't mean much more than "computer connected to the Internet".
CaliforniaKarl|1 year ago
HPC workloads are often focused on highly-parallel jobs, with high-speed and (especially) low-latency communications between nodes. Fun fact: In the NVIDIA DGX SuperPOD Reference Architecture, each DGX H100 system (which has eight H100 GPUs per system) has four Infiniband NDR OSFP ports dedicated to GPU traffic. IIRC, each OSFP port operates at 200 Gbps (two lanes of 100 Gbps), allowing each GPU to effectively have its own IB port for GPU-to-GPU traffic.
(NVIDIA's not the only group doing that, BTW: Stanford's Sherlock 4.0 HPC environment[2], in their GPU-heavy servers, also uses multiple NDR ports per system.)
Solutions like that are not something you'll typically find in your typical cloud provider.
Early cloud-based HPC-focused solutions centered on workload locality, not just within a particular zone but with a particular part of a zone, with things like AWS Placement Groups[3]. More-modern Ethernet-based providers will give you guides like [4], telling you how to supplement placement groups with directly-accessible high-bandwidth network adapters, and in particular support for RDMA [4] or RoCE (RDMA over Converged Ethernet), which aims to provide IB-like functionality over Ethernet.
IMO, the closest analog you'll find in the cloud, to environments like Frontier, is going to be IB-based cloud environments from Azure HPC ('general' cloud) [5] and specialty-cloud folks like Lambda Labs [6].
[1]: https://docs.nvidia.com/dgx-superpod/reference-architecture-...
[2]: https://news.sherlock.stanford.edu/publications/sherlock-4-0...
[3]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placemen...
[4]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html
[5]: https://azure.microsoft.com/en-us/solutions/high-performance...
[6]: https://lambdalabs.com/nvidia/dgx-systems