Std: Clamp generates less efficient assembly than std:min(max,std:max(min,v))

cmovq|2 years ago

Depending on the order of the arguments to min max you'll get an extra move instruction [1]:

std::min(max, std::max(min, v));

        maxsd   xmm0, xmm1
        minsd   xmm0, xmm2

std::min(std::max(v, min), max);

        maxsd   xmm1, xmm0
        minsd   xmm2, xmm1
        movapd  xmm0, xmm2

For min/max on x86 if any operand is NaN the instruction copies the second operand into the first. So the compiler can't reorder the second case to look like the first (to leave the result in xmm0 for the return value).

The reason for this NaN behavior is that minsd is implemented to look like `(a < b) ? a : b`, where if any of a or b is NaN the condition is false, and the expression evaluates to b.

Possibly std::clamp has the comparisons ordered like the second case?

[1]: https://godbolt.org/z/coes8Gdhz

x1f604|2 years ago

I think the libstdc++ implementation does indeed have the comparisons ordered in the way that you describe. I stepped into the std::clamp() call in gdb and got this:

    ┌─/usr/include/c++/12/bits/stl_algo.h──────────────────────────────────────────────────────────────────────────────────────
    │     3617     \*  @pre `_Tp` is LessThanComparable and `(__hi < __lo)` is false.
    │     3618     \*/
    │     3619    template<typename _Tp>
    │     3620      constexpr const _Tp&
    │     3621      clamp(const _Tp& __val, const _Tp& __lo, const _Tp& __hi)
    │     3622      {
    │     3623        __glibcxx_assert(!(__hi < __lo));
    │  >  3624        return std::min(std::max(__val, __lo), __hi);
    │     3625      }
    │     3626

miohtama|2 years ago

Sir cmovq, you have deserved your username.

lebubule|2 years ago

Yes, I arrived at the same conclusion.

The various code snippets in the article don't compute the same "function". The order between the min() and max() matters even when done "by hand". This is apparent when min is greater than max as the results differ in the choice of the boundaries.

Funny that for such simple functions the discussion can become quickly so difficult/interesting.

Some toying around with the various implementations in C [1]:

[1]: https://godbolt.org/z/d4Tcdojx3

vitorsr|2 years ago

It seems that this is close to the most likely reason. See also:

https://godbolt.org/z/q7e3MrE66

camblomquist|2 years ago

I did a double take on this because I wrote a blog post about this topic a few months ago and came to a very different conclusion, that the results are effectively identical on clang and gcc is just weird.

Then I realized that I was writing about compiling for ARM and this post is about x86. Which is extra weird! Why is the compiler better tuned for ARM than x86 in this case?

Never did figure out what gcc's problem was.

https://godbolt.org/z/Y75qnTGdr

frozenport|2 years ago

Try switching to -Ofast it produces different ASM

celegans25|2 years ago

On gcc 13, the difference in assembly between the min(max()) version and std::clamp is eliminated when I add the -ffast-math flag. I suspect that the two implementations handle one of the arguments being NaN a bit differently.

https://gcc.godbolt.org/z/fGaP6roe9

I see the same behavior on clang 17 as well

https://gcc.godbolt.org/z/6jvnoxWhb

gumby|2 years ago

You (celegans25) probably know this but here is a PSA that -ffast-math is really -finaccurate-math. The knowledgeable developer will know when to use it (almost never) while the naive user will have bugs.

unknown|2 years ago

[deleted]

jeffbee|2 years ago

Clang generates the shortest of these if you target sandybridge, or x86-64-v3, or later. The real article that's buried in this article is that compilers target k8-generic unless you tell them otherwise, and the features and cost model of opteron are obsolete.

Always specify your target.

josephg|2 years ago

Yep. Adding "-C target-cpu=native" to rustc on my desktop computer consistently gets a ~10-15% performance boost compared to the default target. The default target is extremely conservative. As far as I can tell, it doesn't take advantage of any CPU features added in the last 20 years. (The k8 came out in 2003.)

x1f604|2 years ago

Even with -march=x86-64-v4 at -O3 the compiler still generates fewer lines of assembly for the incorrect clamp compared to the correct clamp for this "realistic" code:

https://godbolt.org/z/hd44KjMMn

svantana|2 years ago

I'm a heavy std::clamp user, but I'm considering replacing it with min+max because of the uncertainty about what will happen when lo > hi. On windows it triggers an assertion, while other platforms just do a min+max in one or the other order. Of course, this should never happen but can be difficult to guarantee when the limits are derived from user inputs.

lpapez|2 years ago

> Of course, this should never happen but can be difficult to guarantee when the limits are derived from user inputs.

Sounds to me like you are missing a validation step before calling your logic. When it comes to parsing, trusting user input is a recipe for disaster in the form of buffer overruns and potential exploits.

As they used to say in the Soviet Union: "trust, but verify".

lifthrasiir|2 years ago

Pretty sure that their behaviors on NaN arguments will also differ.

wegfawefgawefg|2 years ago

I hope they fix it. Thats quite a basic functional unit for it to be a footgun all on its own.

dahart|2 years ago

Will min+max help you? What do you expect the answer to be when lo > hi? What certainty should std::clamp have? Using min+max on a number that’s between lo+hi when lo>hi will always return either lo or hi, and never your input value.

tambre|2 years ago

Both recent GCC and Clang are able to generate the most optimal version for std::clamp() if you add something like -march=znver1, even at -O1 [0]. Interesting!

[0] https://godbolt.org/z/YsMMo7Kjz

GrumpySloth|2 years ago

But then it uses AVX instructions. (You can replace -march=znver1 with just -mavx.)

When AVX isn’t enabled, the std::min + std::max example still uses fewer instructions. Looks like a random register allocation failure.

x1f604|2 years ago

Even with -march=znver1 at -O3 the compiler still generates fewer lines of assembly for the incorrect clamp compared to the correct clamp for this "realistic" code:

https://godbolt.org/z/WMKbeq5TY

planede|2 years ago

On a somewhat similar note, don't use std::lerp if you don't need its strong guarantees around rounding (monotonicity among other things).

https://godbolt.org/z/hzrG3s6T4

CountHackulus|2 years ago

I see that the assembly instructions are different, but what's the performance difference? Personally, I don't care about the number of instructions used, as long as it's faster. With things like store forwarding and register files, a lot of those movs might be treated as noops.

superjan|2 years ago

The only times I worry about min/max/clamp performance is when I need to do thousands or millions of them. And in that case, I’d suggest intrinsics. You get to choose how NaN is handled, it’s branchless, and you can do multiple in parallel.

It feels backwards that you need to order your comparisons so as to generate optimal assembly.

nickysielicki|2 years ago

https://bugs.llvm.org/show_bug.cgi?id=47271

This specific test (click the godbolt links) does not reproduce the issue.

unknown|2 years ago

[deleted]

fooker|2 years ago

If you benchmark these, you'll likely find the version with the jump edges out the one with the conditional instruction in practice.

jeffbee|2 years ago

FYI. https://quick-bench.com/q/sK9t9GoFDRkx9XxloUUbB8Q3ht4'

Using this microbenchmark on an Intel Sapphire Rapids CPU, compiled with march=k8 to get the older form, takes ~980ns, while compiling with march=native gives ~570ns. It's not at all clear that the imperfection the article describes is really relevant in context, because the compiler transforms this function into something quite different.

pclmulqdq|2 years ago

Compilers often under-generate conditional instructions. They implicitly assume (correctly) that most branches you write are 90/10 (ie very predictable), not 50/50. The branches that actually are 50/50 suffer from being treated as being 90/10.

svantana|2 years ago

That must depend on the platform and the surrounding code, no?

142 comments