For min/max on x86 if any operand is NaN the instruction copies the second operand into the first. So the compiler can't reorder the second case to look like the first (to leave the result in xmm0 for the return value).
The reason for this NaN behavior is that minsd is implemented to look like `(a < b) ? a : b`, where if any of a or b is NaN the condition is false, and the expression evaluates to b.
Possibly std::clamp has the comparisons ordered like the second case?
I think the libstdc++ implementation does indeed have the comparisons ordered in the way that you describe. I stepped into the std::clamp() call in gdb and got this:
The various code snippets in the article don't compute the same "function".
The order between the min() and max() matters even when done "by hand".
This is apparent when min is greater than max as the results differ in the choice of the boundaries.
Funny that for such simple functions the discussion can become quickly so difficult/interesting.
Some toying around with the various implementations in C [1]:
I did a double take on this because I wrote a blog post about this topic a few months ago and came to a very different conclusion, that the results are effectively identical on clang and gcc is just weird.
Then I realized that I was writing about compiling for ARM and this post is about x86. Which is extra weird! Why is the compiler better tuned for ARM than x86 in this case?
On gcc 13, the difference in assembly between the min(max()) version and std::clamp is eliminated when I add the -ffast-math flag. I suspect that the two implementations handle one of the arguments being NaN a bit differently.
You (celegans25) probably know this but here is a PSA that -ffast-math is really -finaccurate-math. The knowledgeable developer will know when to use it (almost never) while the naive user will have bugs.
Clang generates the shortest of these if you target sandybridge, or x86-64-v3, or later. The real article that's buried in this article is that compilers target k8-generic unless you tell them otherwise, and the features and cost model of opteron are obsolete.
Yep. Adding "-C target-cpu=native" to rustc on my desktop computer consistently gets a ~10-15% performance boost compared to the default target. The default target is extremely conservative. As far as I can tell, it doesn't take advantage of any CPU features added in the last 20 years. (The k8 came out in 2003.)
Even with -march=x86-64-v4 at -O3 the compiler still generates fewer lines of assembly for the incorrect clamp compared to the correct clamp for this "realistic" code:
I'm a heavy std::clamp user, but I'm considering replacing it with min+max because of the uncertainty about what will happen when lo > hi. On windows it triggers an assertion, while other platforms just do a min+max in one or the other order. Of course, this should never happen but can be difficult to guarantee when the limits are derived from user inputs.
> Of course, this should never happen but can be difficult to guarantee when the limits are derived from user inputs.
Sounds to me like you are missing a validation step before calling your logic. When it comes to parsing, trusting user input is a recipe for disaster in the form of buffer overruns and potential exploits.
As they used to say in the Soviet Union: "trust, but verify".
Will min+max help you? What do you expect the answer to be when lo > hi? What certainty should std::clamp have? Using min+max on a number that’s between lo+hi when lo>hi will always return either lo or hi, and never your input value.
Both recent GCC and Clang are able to generate the most optimal version for std::clamp() if you add something like -march=znver1, even at -O1 [0]. Interesting!
Even with -march=znver1 at -O3 the compiler still generates fewer lines of assembly for the incorrect clamp compared to the correct clamp for this "realistic" code:
I see that the assembly instructions are different, but what's the performance difference? Personally, I don't care about the number of instructions used, as long as it's faster. With things like store forwarding and register files, a lot of those movs might be treated as noops.
The only times I worry about min/max/clamp performance is when I need to do thousands or millions of them. And in that case, I’d suggest intrinsics. You get to choose how NaN is handled, it’s branchless, and you can do multiple in parallel.
It feels backwards that you need to order your comparisons so as to generate optimal assembly.
Using this microbenchmark on an Intel Sapphire Rapids CPU, compiled with march=k8 to get the older form, takes ~980ns, while compiling with march=native gives ~570ns. It's not at all clear that the imperfection the article describes is really relevant in context, because the compiler transforms this function into something quite different.
Compilers often under-generate conditional instructions. They implicitly assume (correctly) that most branches you write are 90/10 (ie very predictable), not 50/50. The branches that actually are 50/50 suffer from being treated as being 90/10.
cmovq|2 years ago
std::min(max, std::max(min, v));
std::min(std::max(v, min), max); For min/max on x86 if any operand is NaN the instruction copies the second operand into the first. So the compiler can't reorder the second case to look like the first (to leave the result in xmm0 for the return value).The reason for this NaN behavior is that minsd is implemented to look like `(a < b) ? a : b`, where if any of a or b is NaN the condition is false, and the expression evaluates to b.
Possibly std::clamp has the comparisons ordered like the second case?
[1]: https://godbolt.org/z/coes8Gdhz
x1f604|2 years ago
miohtama|2 years ago
lebubule|2 years ago
The various code snippets in the article don't compute the same "function". The order between the min() and max() matters even when done "by hand". This is apparent when min is greater than max as the results differ in the choice of the boundaries.
Funny that for such simple functions the discussion can become quickly so difficult/interesting.
Some toying around with the various implementations in C [1]:
[1]: https://godbolt.org/z/d4Tcdojx3
vitorsr|2 years ago
https://godbolt.org/z/q7e3MrE66
camblomquist|2 years ago
Then I realized that I was writing about compiling for ARM and this post is about x86. Which is extra weird! Why is the compiler better tuned for ARM than x86 in this case?
Never did figure out what gcc's problem was.
https://godbolt.org/z/Y75qnTGdr
frozenport|2 years ago
celegans25|2 years ago
https://gcc.godbolt.org/z/fGaP6roe9
I see the same behavior on clang 17 as well
https://gcc.godbolt.org/z/6jvnoxWhb
gumby|2 years ago
unknown|2 years ago
[deleted]
jeffbee|2 years ago
Always specify your target.
josephg|2 years ago
x1f604|2 years ago
https://godbolt.org/z/hd44KjMMn
svantana|2 years ago
lpapez|2 years ago
Sounds to me like you are missing a validation step before calling your logic. When it comes to parsing, trusting user input is a recipe for disaster in the form of buffer overruns and potential exploits.
As they used to say in the Soviet Union: "trust, but verify".
lifthrasiir|2 years ago
wegfawefgawefg|2 years ago
dahart|2 years ago
tambre|2 years ago
[0] https://godbolt.org/z/YsMMo7Kjz
GrumpySloth|2 years ago
When AVX isn’t enabled, the std::min + std::max example still uses fewer instructions. Looks like a random register allocation failure.
x1f604|2 years ago
https://godbolt.org/z/WMKbeq5TY
planede|2 years ago
https://godbolt.org/z/hzrG3s6T4
CountHackulus|2 years ago
superjan|2 years ago
It feels backwards that you need to order your comparisons so as to generate optimal assembly.
nickysielicki|2 years ago
This specific test (click the godbolt links) does not reproduce the issue.
unknown|2 years ago
[deleted]
fooker|2 years ago
jeffbee|2 years ago
Using this microbenchmark on an Intel Sapphire Rapids CPU, compiled with march=k8 to get the older form, takes ~980ns, while compiling with march=native gives ~570ns. It's not at all clear that the imperfection the article describes is really relevant in context, because the compiler transforms this function into something quite different.
pclmulqdq|2 years ago
svantana|2 years ago