One important point that the article doesn't touch on is determinism. rsqrtps is implemented differently on different CPUs, and so if having reproducible floating point results is a requirement (there's a lot of use-cases for this), you simply cannot use it. Your only remaining option is to use a totally IEEE-754 compliant algorithm that is guaranteed to work the same on every CPU that implements IEEE-754 floats, and for that there's still no better approach than using the Q_rsqrt idea, of course with some modifications for the "modern age".
> rsqrtps is implemented differently on different CPUs
That applies to x86 for sure. But the designers of ARM and RISC-V had the foresight to standardise the implementation of rsqrt to make it deterministic ... on each respective platform. But on either, the precision is only 7 bits.
> for that there's still no better approach than using the Q_rsqrt idea
Or you can just take a square root and divide; these are (partially-)pipelined operations on modern CPUs, with _much_ shorter latency than they had when "fast inverse square root" was a thing.
It still has a niche, but that niche is very, very narrow today.
This is a good point, I updated the article to include a comparison where the naive method is only using standardized floating-point operations. When not using -funsafe-math-optimizations the compiler emits sqrtps followed by divps (sqrtps seems to implement sqrt of ieee-754).
In this case, the Q_rsqrt actually seems to provide a 2-4x speedup compared to the reproducible naive method.
> Your only remaining option is to use a totally IEEE-754 compliant algorithm that is guaranteed to work the same on every CPU that implements IEEE-754 floats
OK but, doesn't intel internally use 80 bits of precision for float64 computations? If that's the case, you can't even guarantee that float64 multiplication and addition would behave the same on say x86 vs arm.
I wonder to what extent the Newton-Rhapson strategy plays nicer with big fancy reordering/pipelining/superscalar chips. It has more little instructions to shuffle around, so my gut says it should be beneficial, but the gut can be misleading for this kind of stuff.
Also,
-funsafe-math-optimizations
Fun, safe math optimizations should be turned on by default! ;)
If you have to do this in software, Goldschmidt's algorithm parallelizes a lot better than Newton-Raphson, but isn't always general-purpose (IIRC). It uses a multiplicative update rule instead of an addition like NR. Division, square root, and inverse square root all use that algorithm under the hood (at least in AMD and IBM processors).
You also need to randomise the object code location across multiple builds.
I would not write a better explanation than Emery Berger speaks: https://youtu.be/r-TLSBdHe1A?t=10m41s Key: “Layout changes can shift performance by +/-40%”
Not sure it matters so much on small in-cache programs, but worth thinking about when profiling larger programs.
> It is not possible, in a normal distribution, to be multiple times the standard deviation away from the mean. However, it happens much more easily with a log normal.
Should that read it is not _impossible_ otherwise it is quite wrong. I mean it's a gaussian probability distribution you sure can be several standard deviations away from the mean, admittedly at smaller and smaller probability
A colleague and I were once discussing the fast inverse square root and joked that we need to make a (neural net) activation function that uses an inverse square root as an excuse to use the fast inverse square root. At any rate, I did come up with an activation function that is very similar to Swish/GELU but uses an inverse square root:
Haha I remember that the internal name for it was DoomSwish for a while a or something like that due to the fast inverse square root being often (I think wrongfully?) attributed to John Carmack. But it was used in Quake anyways right? XD
It's a shame that -fno-math-errno isn't the default. It pessimizes many common operations, as can be seen in the article. Also e.g. a simple sqrt() call like
One can see that with -fno-math-errno the function can be entirely inlined. But if errno is enabled, it has to first check whether the input is negative, and in that case call the libc sqrt() function which sets errno.
As for why it's not the default, I guess it's historical. The errno approach was common back in the days before IEEE 754 with its exception model provided another way.
I believe you are correct, thanks for pointing that out. clock_gettime gives us ns and I multiply it by 1000 which gives us ps. Not sure how I initially got it to be fs. Updated it now.
[+] [-] thegeomaster|2 years ago|reply
[+] [-] Findecanor|2 years ago|reply
That applies to x86 for sure. But the designers of ARM and RISC-V had the foresight to standardise the implementation of rsqrt to make it deterministic ... on each respective platform. But on either, the precision is only 7 bits.
Recent AMD and Intel x86-64 processors use 11 input bits, and the results are similar enough that only 22 results over the whole input range are different. Source: <https://robert.ocallahan.org/2021/09/emulating-amd-rsqrtss-e...>
[+] [-] stephencanon|2 years ago|reply
Or you can just take a square root and divide; these are (partially-)pipelined operations on modern CPUs, with _much_ shorter latency than they had when "fast inverse square root" was a thing.
It still has a niche, but that niche is very, very narrow today.
[+] [-] nhellman|2 years ago|reply
In this case, the Q_rsqrt actually seems to provide a 2-4x speedup compared to the reproducible naive method.
[+] [-] colejohnson66|2 years ago|reply
[+] [-] snek_case|2 years ago|reply
OK but, doesn't intel internally use 80 bits of precision for float64 computations? If that's the case, you can't even guarantee that float64 multiplication and addition would behave the same on say x86 vs arm.
[+] [-] canadianfella|2 years ago|reply
[deleted]
[+] [-] bee_rider|2 years ago|reply
Also,
-funsafe-math-optimizations
Fun, safe math optimizations should be turned on by default! ;)
[+] [-] pclmulqdq|2 years ago|reply
[+] [-] stephc_int13|2 years ago|reply
I would not write a better explanation than Daniel Lemire on his blog:
https://lemire.me/blog/2023/04/06/are-your-memory-bound-benc...
[+] [-] robocat|2 years ago|reply
I would not write a better explanation than Emery Berger speaks: https://youtu.be/r-TLSBdHe1A?t=10m41s Key: “Layout changes can shift performance by +/-40%”
Not sure it matters so much on small in-cache programs, but worth thinking about when profiling larger programs.
[+] [-] cycomanic|2 years ago|reply
> It is not possible, in a normal distribution, to be multiple times the standard deviation away from the mean. However, it happens much more easily with a log normal.
Should that read it is not _impossible_ otherwise it is quite wrong. I mean it's a gaussian probability distribution you sure can be several standard deviations away from the mean, admittedly at smaller and smaller probability
[+] [-] microtonal|2 years ago|reply
https://twitter.com/danieldekok/status/1484898130441166853?s...
It's quite a bit cheaper, because it doesn't need expensive elementary functions like exp or erf.
(We did add it to Thinc: https://thinc.ai/docs/api-layers#dish)
[+] [-] kadarakos|2 years ago|reply
[+] [-] jabl|2 years ago|reply
#include <math.h>
double mysqrt(double d) { return sqrt(d); }
with and without -fno-math-errno: https://godbolt.org/z/bvrz9r8ce
One can see that with -fno-math-errno the function can be entirely inlined. But if errno is enabled, it has to first check whether the input is negative, and in that case call the libc sqrt() function which sets errno.
As for why it's not the default, I guess it's historical. The errno approach was common back in the days before IEEE 754 with its exception model provided another way.
E.g. for glibc: https://man7.org/linux/man-pages/man7/math_error.7.html
Musl libc, being newer, does away with that and never sets errno in libm functions: https://wiki.musl-libc.org/mathematical-library.html
[+] [-] commandlinefan|2 years ago|reply
[+] [-] adgjlsfhk1|2 years ago|reply
[+] [-] geertj|2 years ago|reply
I believe the benchmark program outputs the wrong units? It should be picoseconds (ps) instead of femtoseconds (fs)?
[+] [-] nhellman|2 years ago|reply
[+] [-] jabl|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] clircle|2 years ago|reply
[+] [-] mikerg87|2 years ago|reply
[+] [-] seventytwo|2 years ago|reply
[+] [-] djmips|2 years ago|reply
[+] [-] nikanj|2 years ago|reply