I don't think there is a generally optimal design. There are cons and pros to using the same homogeneous FMAs units for adds, multiplies and fmas, even at the cost of making adds slower (simpler design, and having all instructions of the same latency greatly simplifies scheduling). IIRC intel cycled through 4 cycles fma, add and mul, then to 4 cycles add and mul and 5 cycles fmas, then with a dedicated 3 cycles add.
The optimal design depends a lot on the rest of the microarchitecture, the loads the core is being optimized for, the target frequency, the memory latency, etc.
Adds are cheaper only for fixed-point computations. Floating point addition needs to denormalize one of its' arguments, perform an (integer) addition and then normalize the result.
Usually FP adds take a cycle or two longer than FP multiplication.
gpderetta|1 year ago
The optimal design depends a lot on the rest of the microarchitecture, the loads the core is being optimized for, the target frequency, the memory latency, etc.
bonzini|1 year ago
thesz|1 year ago
Usually FP adds take a cycle or two longer than FP multiplication.