She's right about squaring on each iteration, though. And granted, a square root is much more expensive, though done only once. Which method is faster would depend on the number of iterations.
The latency for sqrtss on broadwell is 11 cycles with a throughput of 4, where mul is a latency of 3 and throughput of 1. So, using some concrete numbers, sqrt is more expensive, but not polynomially or even an order of magnitude.
uxcn|10 years ago
JoeAltmaier|10 years ago