(no title)
jcranmer | 11 days ago
I'm not a hardware guy, but an explanation I've seen from someone who is says that it's not much extra hardware to add to a 2×f32 FMA unit the capability to do 1×f64. You already have all of the per-bit logic, you mostly just need to add an extra control line to make a few carries propagate. So the size overhead of adding FP64 to the SIMD units is more like 10-50%, not 100-300%.
adrian_b|11 days ago
Even so, the multipliers and shifters occupy only a small fraction of the total area, a fraction that is smaller then implied by their number of gates, because they have very regular layouts.
A reduction from the ideal 1:2 FP64/FP32 throughput to 1:4 or in the worst case to 1:8 should be enough to make negligible the additional cost of supporting FP64, while still keeping the throughput of a GPU competitive with a CPU.
The current NVIDIA and AMD GPUs cannot compete in FP64 performance per dollar or per watt with Zen 5 Ryzen 9 CPUs. Only Intel B580 is better in FP64 performance per dollar than any CPU, though its total performance is exceeded by CPUs like 9950X.