In their comparisons they explain that the existing Hot Cold Split creates new functions for the cold parts which brings overheads due to calling conventions and exception handling. Their new method avoids that by using simple jumps.
I wish compilers were able to optimize functions boundaries for visible but not inlined better so they could optimize those function call overheads away automatically. There is no reason for a compiler to limit itself to the system ABI for completely internal functions.
Besides leaving values in the registers they are already in instead of moving them to where the ABI says they should be, this could enable passing C++ objects such as std::unique_ptr in registers where the current ABI forbids this. Or to eliminate nop destructors for objects that are always moved from in the called functions.
In general, I think compilers should see function boundaries in the source code as a mere hint when it comes to code generation just like they already do with the register keyword.
Is there any existing compiler work to optimize function calls where the implementation and all uses are visible in the translation unit?
> There is no reason for a compiler to limit itself to the system ABI for completely internal functions.
As far as I understand it, this limitation is also the only thing preventing tail-call elimination in C/C++. Tail-recursive functions are essentially the same as 'while' loops (execute some instructions then conditionally go back to the start); they just have different scope and calling conventions. In particular, 'while' loops are always inline and can't be referenced from other functions, but tail-recursive functions can. Hence compilers can apply more aggressive optimisations to 'while' loops, e.g. re-using the same stack frame and iterating with a conditional jump.
However, as you say, functions-as-subroutines don't necessarily need to coincide with functions-as-interface. Tail-recursive functions which are only used internally can also be optimised to re-use the same stack frame and iterate with a conditional jump. If such functions are also exposed in the interface, it's pretty trivial to expose a dummy entry point which invokes the optimised version (if we want to); although cross-unit tail calls wouldn't be eliminated in that case.
Of course, this is more useful than just writing 'while' loops in a different style (although I personally prefer recursion https://news.ycombinator.com/item?id=11150346 ), since tail calls don't have to be recursive: we can split our code up into modular, composable, self-contained functions without having to worry about stack usage (or code bloat caused by inlining into multiple locations).
I've seen a poster presentation at the student research competition of a PL conference where the student created a tool to do this after the fact on binaries. It looked quite promising, but I think preliminary results were single digit percentage speedup for most benchmarks. I'll have to see if I can find it again.
Functions by default have external linkage, and as such can be called by anyone. So of course this could only apply to functions that are static or internal in some way. Once you have that down, you have to deal with the fact that the things you would want to optimize may differ at each call site (one may like to keep rdx free, one may be able to give up rdi) so I am unsure it would be worth the effort to not follow the ABI.
Or at least any that supports gathering and utilizing profiling information. Last time I checked, that didn't include Rust yet for instance, but my information could be out of date there.
Note that LLVM has already had a function outliner; you may have seen these paths already if you've ever looked inside a binary on macOS and noticed * .cold. * functions, which is the unlikely-to-execute code cut out from the function they came from. This appears to be an improvement on that effort.
Right. The improvement that this work brings is that it performs the function split very late and at a low level.
Basically, while the previous outliner split a function into two functions (the hot one literally calling the cold one as needed) this new thing takes a single function and splits it into two parts connected to each other by jumps. The cold part of the function isn't really a function --- it's just a group of basic blocks that happen to be located far away from the other group of basic blocks.
By avoiding the call into the cold function and the return to the hot function, the generated code can be smaller and more register-efficient.
It always surprises me how many people are familiar with LLVM internals. Just looking at the 60 unique commenters at the time of posting this is blowing my mind.
I wouldn't have expected any functions to be 5KiB or larger. That seems like a lot of assembly to me, and so it makes sense to me that these functions would be "accidentally" generated by inlining.
I'm curious what percent of functions are this large?
This + multi-return (push multiple code addresses in stack frame) together mean I think we can just use Result/Either and never unwinding, and leave no performance behind.
2% better perf isn't anything to sneeze at but also not super drastic. Will be curious to see if this unlocks more optimizations over time.
I'm also curious if there's been any work to organize code layout to make it more efficient to page stuff out efficiently to reduce memory pressure (so that you're using all code in the pages you are bringing in) & reduce paging thrash (you're not evicting a code page only to bring it back for a few hundred bytes).
Given the greatly improved TLB hit rate, I sort of wonder if the test isn't abusing the code enough. Does this change the inflection point where throughput falls from trying to run too many tasks in parallel or sequentially? 2% response time won't do much, but 10% higher peak load would change quite a few priorities on a lot of backlogs.
So to use this new optimization pass you have to specify it explicitly with an additional command line flag instead of it simply being turned on as an ordinary part of PGO?
It seems like most people who just want to squeeze the last drop of speed out of their code will probably never become aware of it then unless they are specifically seeking it out
It's an experimental feature. If it works well for a while, it may become part of the default. But it should definitely not be forced on everyone if it's not yet sure how well it works.
One issue with PGO is you are optimizing for the particular subset of use cases which are profiled, on the exact machine you profile on. The 2% quoted improvement will not necessarily generalize across all use cases for the software and all machines it runs on. In fact, you may often be taking that 2% or more away from other use cases.
> One issue with PGO is you are optimizing for the particular subset of use cases which are profiled, on the exact machine you profile on.
The exact machine part is not true. There's nothing about this particular optimization that's machine specific - e.g. as the original post explains, this optimization gives performance boost on Intel and AMD, on Intel due to reduction in iTLB misses, and on AMD due to reduction in L1 and L2 icache misses. i.e. this kind of "working-set" reduction translates to any platform.
> In fact, you may often be taking that 2% or more away from other use cases.
In general, it is correct that profile-guided optimization can theoretically reduce performance, as some of the aggressive optimizations are only done with profile because of inherent trade-off the optimization has (e.g. aggressive inlining which can be detrimental for the performance if hot functions are entirely different).
However, empirically this is not true in most cases, unless you picked really bad training input, and your code has extremely different behavior under different input. Moreover, nowadays with sampled profile, which you can collect from the real, production runs, it's extremely unlikely for this to happen.
Yes, taking advantage of PGO is hard. But with the right infrastructure you get very accurate profiles. For example many large companies will gather the profiles from the previous version running in production. I have even seen cases where the new version was fed replayed requests then recompiled with the recorded profiles.
So yes, PGO is hard, but if you collect representative profiles it is a massive improvement.
[+] [-] account42|5 years ago|reply
I wish compilers were able to optimize functions boundaries for visible but not inlined better so they could optimize those function call overheads away automatically. There is no reason for a compiler to limit itself to the system ABI for completely internal functions.
Besides leaving values in the registers they are already in instead of moving them to where the ABI says they should be, this could enable passing C++ objects such as std::unique_ptr in registers where the current ABI forbids this. Or to eliminate nop destructors for objects that are always moved from in the called functions.
In general, I think compilers should see function boundaries in the source code as a mere hint when it comes to code generation just like they already do with the register keyword.
Is there any existing compiler work to optimize function calls where the implementation and all uses are visible in the translation unit?
[+] [-] chriswarbo|5 years ago|reply
As far as I understand it, this limitation is also the only thing preventing tail-call elimination in C/C++. Tail-recursive functions are essentially the same as 'while' loops (execute some instructions then conditionally go back to the start); they just have different scope and calling conventions. In particular, 'while' loops are always inline and can't be referenced from other functions, but tail-recursive functions can. Hence compilers can apply more aggressive optimisations to 'while' loops, e.g. re-using the same stack frame and iterating with a conditional jump.
However, as you say, functions-as-subroutines don't necessarily need to coincide with functions-as-interface. Tail-recursive functions which are only used internally can also be optimised to re-use the same stack frame and iterate with a conditional jump. If such functions are also exposed in the interface, it's pretty trivial to expose a dummy entry point which invokes the optimised version (if we want to); although cross-unit tail calls wouldn't be eliminated in that case.
Of course, this is more useful than just writing 'while' loops in a different style (although I personally prefer recursion https://news.ycombinator.com/item?id=11150346 ), since tail calls don't have to be recursive: we can split our code up into modular, composable, self-contained functions without having to worry about stack usage (or code bloat caused by inlining into multiple locations).
[+] [-] Apanatshka|5 years ago|reply
[+] [-] saagarjha|5 years ago|reply
[+] [-] webreac|5 years ago|reply
[+] [-] jeffbee|5 years ago|reply
https://lists.llvm.org/pipermail/llvm-dev/2020-August/144012...
[+] [-] mleonhard|5 years ago|reply
https://groups.google.com/d/msg/llvm-dev/RUegaMg-iqc/wFAVxa6...
[+] [-] snehasish|5 years ago|reply
https://groups.google.com/g/llvm-dev/c/RUegaMg-iqc/m/VFyV9cX...
[+] [-] dang|5 years ago|reply
[+] [-] MaxBarraclough|5 years ago|reply
Not bad. Presumably this will benefit every compiler that uses LLVM, not just Clang.
[+] [-] monocasa|5 years ago|reply
[+] [-] saagarjha|5 years ago|reply
[+] [-] quotemstr|5 years ago|reply
Basically, while the previous outliner split a function into two functions (the hot one literally calling the cold one as needed) this new thing takes a single function and splits it into two parts connected to each other by jumps. The cold part of the function isn't really a function --- it's just a group of basic blocks that happen to be located far away from the other group of basic blocks.
By avoiding the call into the cold function and the return to the hot function, the generated code can be smaller and more register-efficient.
[+] [-] ladberg|5 years ago|reply
[+] [-] repsilat|5 years ago|reply
The difference is sampling prod vs sampling separately in test. The arguments for FDO:
- Prod behaviour/data is always "representative", whereas synthetic or recorded data can go out of date quickly.
-PGO test fixtures can contain sensitive user data. Instrumenting production processes doesn't put data in more places.
The benefits of both are huge though. The rule of thumb I've seen is a 20% improvement for FDO over -O3.
[+] [-] derangedHorse|5 years ago|reply
[+] [-] IvyMike|5 years ago|reply
Ok, I would not have expected that.
[+] [-] jackpirate|5 years ago|reply
I'm curious what percent of functions are this large?
[+] [-] formerly_proven|5 years ago|reply
[+] [-] Ericson2314|5 years ago|reply
[+] [-] zelly|5 years ago|reply
[+] [-] gosukiwi|5 years ago|reply
[+] [-] bonzini|5 years ago|reply
[+] [-] axaxs|5 years ago|reply
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] vlovich123|5 years ago|reply
I'm also curious if there's been any work to organize code layout to make it more efficient to page stuff out efficiently to reduce memory pressure (so that you're using all code in the pages you are bringing in) & reduce paging thrash (you're not evicting a code page only to bring it back for a few hundred bytes).
[+] [-] hinkley|5 years ago|reply
[+] [-] voldacar|5 years ago|reply
It seems like most people who just want to squeeze the last drop of speed out of their code will probably never become aware of it then unless they are specifically seeking it out
[+] [-] mafuy|5 years ago|reply
[+] [-] emerged|5 years ago|reply
[+] [-] drivebycomment|5 years ago|reply
The exact machine part is not true. There's nothing about this particular optimization that's machine specific - e.g. as the original post explains, this optimization gives performance boost on Intel and AMD, on Intel due to reduction in iTLB misses, and on AMD due to reduction in L1 and L2 icache misses. i.e. this kind of "working-set" reduction translates to any platform.
> In fact, you may often be taking that 2% or more away from other use cases.
In general, it is correct that profile-guided optimization can theoretically reduce performance, as some of the aggressive optimizations are only done with profile because of inherent trade-off the optimization has (e.g. aggressive inlining which can be detrimental for the performance if hot functions are entirely different).
However, empirically this is not true in most cases, unless you picked really bad training input, and your code has extremely different behavior under different input. Moreover, nowadays with sampled profile, which you can collect from the real, production runs, it's extremely unlikely for this to happen.
[+] [-] kevincox|5 years ago|reply
So yes, PGO is hard, but if you collect representative profiles it is a massive improvement.
[+] [-] emerged|5 years ago|reply
[+] [-] person_of_color|5 years ago|reply