Yeah, ops comment makes it seem like they are building racks of RTX 4090s, when this isn’t remotely true. Tensor Core performance is far different on the data center class devices vs consumer ones.
RTX 4090s are terrible for this task. Off the top of my head:
- VRAM (obviously). Isn't that where the racks come in? Not really. Nvidia famously removed something as basic as NVLink between two cards from the 3090 to the 4090. When it comes to bandwidth between cards (crucial) even 16 lanes of PCIe 4 isn't fast enough. When you start talking about "racks" unless you're running on server grade CPUs (contributing to cost vs power vs density vs perf) you're not going to have nearly enough PCIe lanes to get very far. Even P2P over PCIe requires a hack geohot developed[0] and needless to say that's umm, less than confidence inspiring for what you would lay out ($$$) in terms of hardware, space, cooling, and power. The lack of ECC is a real issue as well.
- Form factor. Remember PCIe lanes, etc? The RTX 4090 is a ~three slot beast when using air cooling and needless to say rigging up something like the dual slot water cooled 4090s I have at scale is another challenge altogether... How are people going to wire this up? What do the enclosures/racks/etc look like? This isn't like crypto mining where cheap 1x PCIe risers can be used without dramatically limiting performance to the point of useless.
- Performance. As grandparent comment noted 4090s are not designed for this workload. In typical usage for training I see them as 10-20% faster than an RTX 3090 at much higher cost. Compared to my H100 with SXM it's ridiculously slow.
- Market segmentation. Nvidia really knows what they're doing here... There are all kinds of limitations you run into with how the hardware is designed (like Tensor Core performance for inference especially).
- Issues at scale. Look at the Meta post - their biggest issues are things that are dramatically worse with consumer cards like the RTX 4090, especially when you're running with some kind of goofy PCIe cabling issue (like risers).
- Power. No matter what power limiting you employ an RTX 4090 is pretty bad for power/performance ratio. The card isn't fundamentally designed for these tasks - it's designed to run screaming for a few hours a day so gamers can push as many FPS at high res as possible. Training, inference, etc is a different beast and the performance vs power ratio for these tasks is terrible compared to A/H100. Now lets talk about the physical cabling, PSU, etc issues. Yes miners had hacks for this as well but it's yet another issue.
- Fan design. There isn't a single "blower" style RTX 4090 on the market. There was a dual-slot RTX 3090 at one point (I have a bunch of them) but Nvidia made Gigabyte pull them from the market because people were using them for this. Figuring out some kind of air-cooling setup with the fan and cooling design of the available RTX 4090 cards sounds like a complete nightmare...
- Licensing issues. Again, laying out the $$$ for this with a deployment that almost certainly violates the Nvidia EULA is a risky investment.
Three RTX 4090s (at 9 slots) to get "only" 72GB of VRAM, talking over PCIe, using 48 PCIe lanes, multi-node over sloooow ethernet (hitting CPU - slower and yet more power), using what likely ends up at ~900 watts (power limited) for significantly reduced throughput and less VRAM is ridiculous. Scaling the kind of ethernet you need for this (100 gig) comes at a very high per-port cost and due to all of these issues the performance would still be terrible.
I'm all for creativity but deploying "racks" of 4090s for AI tasks is (frankly) flat-out stupid.
mike_d|1 year ago
Hell, Microsoft is renting GPUs from Oracle Cloud to get enough capacity to run Bing.
TeMPOraL|1 year ago
kkielhofner|1 year ago
RTX 4090s are terrible for this task. Off the top of my head:
- VRAM (obviously). Isn't that where the racks come in? Not really. Nvidia famously removed something as basic as NVLink between two cards from the 3090 to the 4090. When it comes to bandwidth between cards (crucial) even 16 lanes of PCIe 4 isn't fast enough. When you start talking about "racks" unless you're running on server grade CPUs (contributing to cost vs power vs density vs perf) you're not going to have nearly enough PCIe lanes to get very far. Even P2P over PCIe requires a hack geohot developed[0] and needless to say that's umm, less than confidence inspiring for what you would lay out ($$$) in terms of hardware, space, cooling, and power. The lack of ECC is a real issue as well.
- Form factor. Remember PCIe lanes, etc? The RTX 4090 is a ~three slot beast when using air cooling and needless to say rigging up something like the dual slot water cooled 4090s I have at scale is another challenge altogether... How are people going to wire this up? What do the enclosures/racks/etc look like? This isn't like crypto mining where cheap 1x PCIe risers can be used without dramatically limiting performance to the point of useless.
- Performance. As grandparent comment noted 4090s are not designed for this workload. In typical usage for training I see them as 10-20% faster than an RTX 3090 at much higher cost. Compared to my H100 with SXM it's ridiculously slow.
- Market segmentation. Nvidia really knows what they're doing here... There are all kinds of limitations you run into with how the hardware is designed (like Tensor Core performance for inference especially).
- Issues at scale. Look at the Meta post - their biggest issues are things that are dramatically worse with consumer cards like the RTX 4090, especially when you're running with some kind of goofy PCIe cabling issue (like risers).
- Power. No matter what power limiting you employ an RTX 4090 is pretty bad for power/performance ratio. The card isn't fundamentally designed for these tasks - it's designed to run screaming for a few hours a day so gamers can push as many FPS at high res as possible. Training, inference, etc is a different beast and the performance vs power ratio for these tasks is terrible compared to A/H100. Now lets talk about the physical cabling, PSU, etc issues. Yes miners had hacks for this as well but it's yet another issue.
- Fan design. There isn't a single "blower" style RTX 4090 on the market. There was a dual-slot RTX 3090 at one point (I have a bunch of them) but Nvidia made Gigabyte pull them from the market because people were using them for this. Figuring out some kind of air-cooling setup with the fan and cooling design of the available RTX 4090 cards sounds like a complete nightmare...
- Licensing issues. Again, laying out the $$$ for this with a deployment that almost certainly violates the Nvidia EULA is a risky investment.
Three RTX 4090s (at 9 slots) to get "only" 72GB of VRAM, talking over PCIe, using 48 PCIe lanes, multi-node over sloooow ethernet (hitting CPU - slower and yet more power), using what likely ends up at ~900 watts (power limited) for significantly reduced throughput and less VRAM is ridiculous. Scaling the kind of ethernet you need for this (100 gig) comes at a very high per-port cost and due to all of these issues the performance would still be terrible.
I'm all for creativity but deploying "racks" of 4090s for AI tasks is (frankly) flat-out stupid.
[0] - https://github.com/tinygrad/open-gpu-kernel-modules