top | item 43734992

(no title)

olaulaja | 10 months ago

I'd say `x * a + (x * x) * b + (x * x * x) * c` is likely faster (subject to the compiler being reasonable) than `((((x * c) * x + b) * x) + a) * x` because it has a shorter longest instruction dependency chain. Add/Mul have higher throughput than latency, so the latency chain dominates performance and a few extra instructions will just get hidden away by instruction level parallelism.

Also x/6 vs x(1/6) is not as bad as it used to be, fdiv keeps getting faster. On my zen2 its 10 cycles latency and 0.33/cycle throughput for (vector) div, and 3 latency and 2/cycle throughput for (vector) add. So about 1/3 speed, worse if you have a lot of divs and the pipeline fills up. Going back to Pentium the difference is ~10x and you don't get to hide it with instruction parallelism.

* The first expression has a chain of 4 instructions that cannot be started before the last one finished `(((x * x) * x) * c) + the rest` vs the entire expression being a such a chain in the second version. Using fma instructions changes this a bit, making all the adds in both expressions 'free' but this changes precision and needs -ffast-math or such, which I agree is dangerous and generally ill advised.

discuss

No comments yet.