top | item 40116644

The Performance Impact of C++'s `final` Keyword

247 points| hasheddan | 1 year ago |16bpp.net

267 comments

order
[+] mgaunard|1 year ago|reply
What final enables is devirtualization in certain cases. The main advantage of devirtualization is that it is necessary for inlining.

Inlining has other requirements as well -- LTO pretty much covers it.

The article doesn't have sufficient data to tell whether the testcase is built in such a way that any of these optimizations can happen or is beneficial.

[+] chipdart|1 year ago|reply
> What final enables is devirtualization in certain cases. The main advantage of devirtualization is that it is necessary for inlining.

I think that enabling inlining is just one of the indirect consequences of devirtualization, and perhaps one that is largely irrelevant for performance improvements.

The whole point of devirtualization is eliminating the need to resort to pointer dereferencing when calling virtual members. The main trait of a virtual class is it's use of a vtable that requires dereferencing virtual members to access each and every one of them.

In classes with larger inheritance chains, you can easily have more than one pointer dereferencing taking place before you call a virtual members function.

Once a class is final, none of that is required anymore. When a member is referred, no dereferencing takes place.

Devirtualization helps performance because you are able to benefit from inheritance and not have to pay a performance penalty for that. Without the final keyword, a performance oriented project would need to be architected to not use inheritance at all, or in the very least in code in the hot path, because that sneaks gratuitous pointer dereferences all over the place, which require running extra operations and has a negative impact on caching.

The whole purpose of the final keyword is that compilers can easily eliminate all pointer dereferencing used by virtual members. What stops them from applying this optimization is that they have no information on whether that class will be inherited and one of its members will either override any of its members or invoke any member function implemented by one of its parent classes.

With the introduction of the final keyword, you are now able to tell the compiler "from thereon, this is exactly what you get" and the compiler can trim out anything loose.

[+] Negitivefrags|1 year ago|reply
See this is why I find this odd.

Is there a theory as to how devirtualisation could hurt performance?

[+] i80and|1 year ago|reply
If you already have LTO, can't the compiler determine this information for devirtualization purposes on its own?
[+] bdjsiqoocwk|1 year ago|reply
Whats devirtualization in C++?

Funny how things work. From working with Julia I've built a good intuition for guessing when functions would be inlined. And yet, I've never heard the word devirtualization until now.

[+] tombert|1 year ago|reply
I don't do much C++, but I have definitely found that engineers will just assert that something is "faster" without any evidence to back that up.

Quick example, I got in an argument with someone a few years ago that claimed in C# that a `switch` was better than an `if(x==1) elseif(x==2)...` because switch was "faster" and rejected my PR. I mentioned that that doesn't appear to be true, we went back and forth until I did a compile-then-decompile of a minimal test with equality-based-ifs, and showed that the compiler actually converts equality-based-ifs to `switch` behind the scenes. The guy accepted my PR after that.

But there's tons of this stuff like this in CS, and I kind of blame professors for a lot of it [1]. A large part of becoming a decent engineer [2] for me was learning to stop trusting what professors taught me in college. Most of what they said was fine, but you can't assume that; what they tell you could be out of date, or simply never correct to begin with, and as far as I can tell you have to always test these things.

It doesn't help that a lot of these "it's faster" arguments are often reductive because they only are faster in extremely minimal tests. Sometimes a microbenchmark will show that something is faster, and there's value in that, but I think it's important that that can also be a small percentage of the total program; compilers are obscenely good at optimizing nowadays, it can be difficult to determine when something will be optimized, and your assertion that something is "faster" might not actually be true in a non-trivial program.

This is why I don't really like doing any kind of major optimizations before the program actually works. I try to keep the program in a reasonable Big-O and I try and minimize network calls cuz of latency, but I don't bother with any kind of micro-optimizations in the first draft. I don't mess with bitwise, I don't concern myself on which version of a particular data structure is a millisecond faster, I don't focus too much on whether I can get away with a smaller sized float, etc. Once I know that the program is correct, then I benchmark to see if any kind of micro-optimizations will actually matter, and often they really don't.

[1] That includes me up to about a year ago.

[2] At least I like to pretend I am.

[+] jandrewrogers|1 year ago|reply
A significant part of it is that what engineers believe was effectively true at one time. They simply haven't revisited those beliefs or verified their relevance in a long time. It isn't a terrible heuristic for life in general to assume that what worked ten years ago will work today. The rate at which the equilibriums shift due to changes in hardware and software environments when designing for system performance is so rapid that you need to make a continuous habit of checking that your understanding of how the world works maps to reality.

I've solved a lot of arguments with godbolt and simple performance tests. Some topics are recurring themes among software engineers e.g.:

- compilers are almost always better at micro-optimizations than you are

- disk I/O is almost never a bottleneck in competent designs

- brute-force sequential scans are often optimal algorithms

- memory is best treated as a block device

- vectorization can offer large performance gains

- etc...

No one is immune to this. I am sometimes surprised at the extent to which assumptions are no longer true when I revisit optimization work I did 10+ years ago.

Most performance these days is architectural, so getting the initial design right often has a bigger impact than micro-optimizations and localized Big-O tweaks. You can always go back and tweak algorithms or codegen later but architecture is permanent.

[+] wvenable|1 year ago|reply
In my opinion, the only things that really matter are algorithmic complexity and readability. And even algorithmic complexity is usually only an issue a certain scales. Whether or not an 'if' is faster than a 'switch' is the micro of micro optimizations -- you better have a good reason to care. The question I would have for you is was your bunch of ifs more readable than a switch would be.
[+] leetcrew|1 year ago|reply
agreed, especially in cases like this. final is primarily a way to prohibit overriding methods and extending classes, and it indicates to the reader that they should not be doing this. use it when it makes conceptual sense.

that said, c++ is usually a language you use when you care about performance, at least to an extent. it's worth understanding features like nrvo and rewriting functions to allow the compiler to pick the optimization if it doesn't hurt readability too much.

[+] BurningFrog|1 year ago|reply
Even if one of these constructs is faster it doesn't matter 99% of the time.

Writing well structured readable code is typically far more important than making it twice as fast. And those times can rarely be predicted beforehand, so you should mostly not worry about it until you see real performance problems.

[+] klyrs|1 year ago|reply
> A large part of becoming a decent engineer [2] for me was learning to stop trusting what professors taught me in college

When I was taught about performance, it was all about benchmarking and profiling. I never needed to trust what my professors taught, because they taught me to dig in and find the truth for myself. This was taught alongside the big-O stuff, with several examples where "fast" algorithms are slower on small inputs.

[+] jollyllama|1 year ago|reply
I've encountered similar situations before. It's insane to me when people hold up PRs over that kind of thing.
[+] ot1138|1 year ago|reply
>I don't do much C++, but I have definitely found that engineers will just assert that something is "faster" without any evidence to back that up.

Very true, though there is one case where one can be highly confident that this is the case: code elimination.

You can't get any faster than not doing something in the first place.

[+] KerrAvon|1 year ago|reply
> `if(x==1) elseif(x==2)...` because switch was "faster" and rejected my PR

Yeah, that's never been true. Old compilers would often compile a switch to __slower__ code because they'd tend to always go to a jump table implementation.

A better reason to use the switch is because it's better style in C-like languages. Using an if statement for that sort of thing looks like Python; it makes the code harder to maintain.

[+] mynameisnoone|1 year ago|reply
Yep. "Profiling or it didn't happen." The issue is that it's essentially impossible for even the most neckbeard of us to predict with a high degree of accuracy and precision the performance on modern systems impact of change A vs. change B due to the unpredictable nature of the many variables that are difficult to control including compiler optimization passes, architecture gotchas (caches, branch misses), and interplay of quirks on various platforms. Therefore, irreducible and necessary work to profile the differences become the primary viable path to resolving engineering decision points. Hopefully, LLMs now and in the future will be able to help build out boilerplate roughly in the direct of creating such profiling benchmarks and fixtures.

PS: I'm presently revisiting C++14 because it's the most universal statically-compiled language to quickly answer interview problems. It would be unfair to impose Rust, Go, Elixir, or Haskell on an interviewer software engineer.

[+] trueismywork|1 year ago|reply
There's not yet a culture of writing reproducible benchmarks to gage these effects.
[+] andrewla|1 year ago|reply
I'm surprised that it has any impact on performance at all, and I'd love to see the codegen differences between the applications.

Mostly the `final` keyword serves as a compile-time assertion. The compiler (sometimes linker) is perfectly capable of seeing that a class has no derived classes, but what `final` assures is that if you attempt to derive from such a class, you will raise a compile-time error.

This is similar to how `inline` works in practice -- rather than providing a useful hint to the compiler (though the compiler is free to treat it that way) it provides an assertion that if you do non-inlinable operations (e.g. non-tail recursion) then the compiler can flag that.

All of this is to say that `final` can speed up runtimes -- but it does so by forcing you to organize your code such that the guarantees apply. By using `final` classes, in places where dynamic dispatch can be reduced to static dispatch, you force the developer to not introduce patterns that would prevent static dispatch.

[+] mgraczyk|1 year ago|reply
The main case where I use final and where I would expect benefits (not covered well by the article) is when you are using an external library with pure virtual interfaces that you implement.

For example, the AWS C++ SDK uses virtual functions for everything. When you subclass their classes, marking your classes as final allows the compiler to devirtualize your own calls to your own functions (GCC does this reliably).

I'm curious to understand better how clang is producing worse code in these cases. The code used for the blog post is a bit too complicated for me to look at, but I would love to see some microbenchmarks. My guess is that there is some kind of icache or code side problem. where inlining more produces worse code.

[+] cogman10|1 year ago|reply
Could easily just be a bad optimization pathway.

`final` tells the compiler that nothing extends this class. That means the compiler can theoretically do things like inlining class methods and eliminate virtual method calls (perhaps duplicating the method)?

However, it's quite possible that one of those optimizations makes the code bigger or misaligns things with the cache in unexpected ways. Sometimes, a method call can bet faster than inlining. Especially with hot loops.

All this being said, I'd expect final to offer very little benefit over PGO. Its main value is the constraint it imposes and not the optimization it might enable.

[+] lpapez|1 year ago|reply
> For example, the AWS C++ SDK uses virtual functions for everything. When you subclass their classes, marking your classes as final allows the compiler to devirtualize your own calls to your own functions (GCC does this reliably).

I want to ask, and I sincerely mean no snark, what is the point?

When working with AWS through an SDK your code will spend most of the time waiting on network calls.

What is the point of devirtualizing your function calls to save an indirection when you will be spending several orders of magnitude more time just waiting for the RPC to resolve?

It just doesn't seem like something even worth thinking about at all.

[+] lionkor|1 year ago|reply
The only thing worse than no benchmark is a bad benchmark.

I don't think this really shows what `final` does, not to code generation, not to performance, not to the actual semantics of the program. There is no magic bullet - if putting `final` on every single class would always make it faster, it wouldn't be a keyword, it'd be a compiler optimization.

`final` does one specific thing: It tells a compiler that it can be sure that the given object is not going to have anything derive from it.

[+] Nevermark|1 year ago|reply
'Final' cannot be assumed without complete knowledge of all final linking cases, and knowledge that this will not change in the future. The latter can never be assumed by a compiler without indication.

"In theory" adding 'final' only gives a compiler more information, so should only result in same or faster code.

In practice, some optimizations improve performance for more expected or important cases (in the compiler writer's estimation), with worse outcomes in other less expected, less important cases. Without a clear understanding the when and how of these 'final' optimizations, it isn't clear without benchmarking after the fact, when to use it, or not.

That makes any given test much less helpful. Since all we know is 'final' was not helpful in this case. We have no basis to know how general these results are.

But it would be deeply strange if 'final' was generally unhelpful. Informationally it does only one purely helpful thing: reduce the number of linking/runtime contexts the compiler needs to worry about.

[+] opticfluorine|1 year ago|reply
Not disagreeing with your point, but it couldn't be a compiler optimization, could it? The compiler isn't able to infer that the class will not be inherited anywhere else, since another compilation unit unknown to the class could inherit.
[+] paulddraper|1 year ago|reply
> `final` does one specific thing: It tells a compiler that it can be sure that the given object is not going to have anything derive from it.

...and the compiler can optimize using that information.

(It could also do the same without the keyword, with LTO.)

[+] akoboldfrying|1 year ago|reply
I would expect "final" to have no effect on this type of code at all. That it does in some cases cause measurable differences I put down to randomly hitting internal compiler thresholds (perhaps one of the inlining heuristics is "Don't inline a function with more than 100 tokens", and the "final" keyword pushes a couple of functions to 101).

Why would I expect no performance difference? I haven't looked at the code, but I would expect that for each pixel, it iterates through an array/vector/list etc. of objects that implement some common interface, and calls one or more methods (probably something called intersectRay() or similar) on that interface. By design, that interface cannot be made final, and that's what counts. Whether the concrete derived classes are final or not makes no difference.

In order to make this a good test of "final", the pointer type of that container should be constrained to a concrete object type, like Sphere. Of course, this means the scene is limited to spheres.

The only case where final can make a difference, by devirtualising a call that couldn't otherwise be devirtualised, is when you hold a pointer to that type, and the object it points at was allocated "uncertainly", e.g., by the caller. (If the object was allocated in the same basic block where the method call later occurs, the compiler already knows its runtime type and will devirtualise the call anyway, even without "final".)

[+] koyote|1 year ago|reply
> (perhaps one of the inlining heuristics is "Don't inline a function with more than 100 tokens", and the "final" keyword pushes a couple of functions to 101).

That definitely is one of the heuristics in MSVC++.

We have some performance critical code and at one point we noticed a slowdown of around ~4% in a couple of our performance tests. I investigated but the only change to that code base involved fixing up an error message (i.e. no logic difference and not even on the direct code path of the test as it would not hit that error).

Turns out that:

    int some_func() {
      if (bad)
        throw std::exception("Error");
    
      return some_int;
    }
Inlined just fine, but after adding more text to the exception error message it no longer inlined, causing the slow-down. You could either fix it with __forceinline or by moving the exception to a function call.
[+] simonask|1 year ago|reply
Actually, the compiler can only implicitly devirtualize under very specific circumstances. For example, it cannot devirtualize if there was previously a non-inlined call through the same pointer.

The reason is placement new. It is legal (given that certain invariants are upheld) in C++ to say `new(this) DerivedClass`, and compilers must assume that each method could potentially have done this, changing the vtable pointer of the object.

The `final` keyword somewhat counteracts this, but even GCC still only opportunistically honors it - i.e. it inserts a check if the vtable is the expected value before calling the devirtualized function, falling back on the indirect call.

[+] ein0p|1 year ago|reply
You should use final to express design intent. In fact I’d rather it were the default in C++, and there was some sort of an opposite (‘derivable’?) keyword instead, but that ship has sailed long time ago. Any measurable negative perf impact should be filed as a bug and fixed.
[+] leni536|1 year ago|reply
C++ doesn't have the fragile base problem, as members aren't virtual my default. The only concern with unintended inheritance is with polymorhpic deletion. "final" on class definition disables some tricks thag you can do with private inheritance.

Having said that "final" on member functions is great, and I like to see that instead of "override".

[+] josefx|1 year ago|reply
Intent is nice and all that, but I would like a "nonwithstanding" keyword instead that just lets me bypass that kind of "intent" without having to copy paste the entire implementation just to remove a pointless keyword or make a destructor public when I need it.
[+] cesarb|1 year ago|reply
> In fact I’d rather it were the default in C++, and there was some sort of an opposite (‘derivable’?) keyword instead

Kotlin (which uses the equivalent of the Java "final" keyword by default) uses the "open" keyword for that purpose.

[+] jbverschoor|1 year ago|reply
In general, I think things should be strict by default. Way easier to optimize and less error prone.
[+] ndesaulniers|1 year ago|reply
As an LLVM developer, I really wish the author filed a bug report and waited for some analysis BEFORE publishing an article (that may never get amended) that recommends not using this keyword with clang for performance reasons. I suspect there's just a bug in clang.
[+] fransje26|1 year ago|reply
Is there any logical reason why Clang is 50% slower than GCC on Ubuntu?
[+] saagarjha|1 year ago|reply
Bug, misunderstanding, weird edge case…
[+] mastax|1 year ago|reply
Changes in the layout of the binary can have large impacts on the program performance [0] so it's possible that the unexpected performance decrease is caused by unpredictable changes in the layout of the binary between compilations. I think there is some tool which helps ensure layout is consistent for benchmarking, but I can't remember what it's called.

[0]: https://research.facebook.com/publications/bolt-a-practical-...

[+] jeffbee|1 year ago|reply
I profiled this project and there are abundant opportunities for devirtualization. The virtual interface `IHittable` is the hot one. However, the WITH_FINAL define is not sufficient, because the hot call is still virtual. At `hit_object |= _objects[node->object_index()]->hit` I am still seeing ` mov (%rdi),%rax; call *0x18(%rax)` so the application of final here was not sufficient to do the job. Whatever differences are being measures are caused by bogons.
[+] bluGill|1 year ago|reply
I use final more for communication. Don't look for deeper derived classes as there are none. that it results in slower code is an annoying surprise.
[+] alex_smart|1 year ago|reply
One thing that wasn't mentioned in the article that I wished it did was the size of the compiled binary with and without final. Only reason I would expect the final version to be slower is that we are emitting more code because of inlining and that is resulting in a larger portion of instruction cache misses.

Also, now that I think of it, they should have run the code under perf and compared the stats.

[+] magnat|1 year ago|reply
> I created a "large test suite" to be more intensive. On my dev machine it needed to run for 8 hours.

During such long and compute-intensive tests, how are thermal considerations mitigated? Not saying that this was case here, but I can see how after saturating all cores for 8 hours, the whole PC might get hot to the point CPU starts throttling, so when you reboot to next OS or start another batch, overall performance could be a bit lower.

[+] JackYoustra|1 year ago|reply
I really wish he'd listed all the flags he used. To add on to the flags already listed by some other commenters, `-mcpu` and related flags are really crucial in these microbenchmarks: over such a small change and such a small set of tight loops, you could just be regression on coincidences in the microarchitecture scheduler vs higher level assumptions.
[+] gpderetta|1 year ago|reply
1% is nothing to scoff of. But I suspect that the variability of compilation (specifically quirks of instruction selection, register allocation and function alignment) more than mask any gains.

The clang regression might be explainable by final allowing some additional inlining and clang making an hash of it.

[+] fransje26|1 year ago|reply
I'm actually more worried about Clang being close to 100% slower than GCC on Linux. That doesn't seem right.

I am prepared to believe that there is some performance difference between the two, varying per case, but I would expect a few percent difference, not twice the run time..

[+] lanza|1 year ago|reply
If you're measuring a compiler you need to post the flags and version used. Otherwise the entire experiment is in the noise.
[+] sfink|1 year ago|reply
tldr: sprinkled a keyword around in the hopes that it "does something" to speed things up, tested it, got noisy results but no miraculous speedup.

I started skimming this article after a while, because it seemed to be going into the weeds of performance comparison without ever backing up to look at what the change might be doing. Which meant that I couldn't tell if I was going to be looking at the usual random noise of performance testing or something real.

For `final`, I'd want to at least see if it changing the generated code by replacing indirect vtable calls with direct or inlined calls. It might be that the compiler is already figuring it out and the keyword isn't doing anything. It might be that the compiler is changing code, but the target address was already well-predicted and it's perturbing code layout enough that it gets slower (or faster). There could be something interesting here, but I can't tell without at least a little assembly output (or perhaps a relevant portion of some intermediate representation, not that I would know which one to look at).

If it's not changing anything, then perhaps there could be an interesting investigation into the variance of performance testing in this scenario. If it's changing something, then there could be an interesting investigation into when that makes things faster vs slower. As it is, I can't tell what I should be looking for.

[+] pklausler|1 year ago|reply
Mildly related programming language trivia:

Fortran has virtual functions ("type bound procedures"), and supports a NON_OVERRIDABLE attribute on them that is basically "final". (FINAL exists but means something else.). But it also has a means for localizing the non-overridable property.

If a type bound procedure is declared in a module, and is PRIVATE, then overrides in subtypes ("extended derived types") work as usual for subtypes in the same module, but can't be affected by overrides that appear in other modules. This allows a compiler to notice when a type has no subtypes in the same module, and basically infer that it is non-overridable locally, and thus resolve calls at compilation time.

Or it would, if compilers implemented this feature correctly. It's not well described in the standard, and only half of the Fortran compilers in the wild actually support it. So like too many things in the Fortran world, it might be useful, but it's not portable.