, or increase the inlining heuristics..., or just create your own optimization strategy like rustc does [1], or...
LLVM is super configurable, so you can make it do whatever you want.
Clang defaults are tuned for the optimizations that give you the most bang for the time you put in, while still being able to compile a Web browser like Chrome or Firefox, or a whole Linux distribution packages, in a reasonable amount of time.
If you don't care about how long compile-times take, then you are not a "target" clang user, but you can just pass clang extra arguments like those mentioned above, or even fork it to add your own -Oeternity option that takes in your project and tries to compile it for a millenia on a super computer for that little extra 0.00001% reduction in code size, at best.
Because often, code compiled with -O3 is slower than code compiled with -O2. Because "more optimizations" do not necessarily mean "faster code" like you seem tobe suggesting.
Yeah this seems like two very different use cases. Stating the obvious: when debugging I want as fast builds as possible. When shipping I want as fast software as possible.
This should be an optimisation that you choose to enable, though, right? It might be prohibitively expensive to run the compile for other reasons (time to do a deploy for a fix, time and cost to run builds for CI etc.) so it has to be something that can be dialled up/down.
For fast compiles, in clang, -fsyntax-only parses as quick as possible. Then you have -O{0,1,2,3s} for various levels of optimization.
As for the long baking compile, you have a concept of super compilation which will check the branches and determine if there are any places where a function is called with constants and partially evaluate (similar to partial application / prameter binding, but you get a newly compiled function) the function with those constants frozen. But then it has to determine if the branch elision makes it worth dropping the other versions of the function. It's a cool topic that I researched a lot about a decade ago but I think it's not an area with a lot of active interest in AOT compilers.
Guile scheme is getting a "baseline- compiler for that reason: it will compile fast, but not do any of the fancy CPS things that the regular compiler does.
It all depends on which optimizations you enable and LLVM is very flexible, albeit sometimes you still spend 20% of your time in ISel (Instruction Selection)...
As the author of one of the changes which could have unknowingly causing a 1% regression, I really appreciate this work measuring and monitoring compile times.
Thanks to nikic for noticing the regression and finding a solution to avoid it.
I really hope this type of infrastructure gets moved into LLVM itself, and people start adding more benchmarks for all the frontends, and somehow integrating this into the CI infrastructure, to be able to block merging PRs on changes that accidentally impact LLVM's performance, like is currently the case for rustc.
But I guess the LLVM project should probably start by making code-reviews mandatory, gating PRs on passing tests so that master doesn't get broken all the time, etc. I really hate it when I update my LLVM locally from git master, and it won't even build because somebody pushed to master without even testing that their changes compile...
For Rust, I hope Cranelift really takes off someday, and we can start to completely ditch LLVM and make it opt-in, only for those cases in which you are willing to trade-off huge compile-times for that last 1% run-time reduction.
> Waymarking was previously employed to avoid explicitly storing the user (or “parent”) corresponding to a use. Instead, the position of the user was encoded in the alignment bits of the use-list pointers (across multiple pointers). This was a space-time tradeoff and reportedly resulted in major memory usage reduction when it was originally introduced. Nowadays, the memory usage saving appears to be much smaller, resulting in the removal of this mechanism. (The cynic in me thinks that the impact is lower now, because everything else uses much more memory.)
Any seasoned programmers would remember a few of such things - you undo a decision made years ago because the assumptions have changed.
Programmers often make these kinds of trade-off choices based on the current state (typical machines the program runs, and typical inputs the program deals with, and the current version of everything else in the program). But all of those environmental factors change over time, which can make the input to the trade-off quite different. Yet, it's difficult to revisit all those decisions systematically as they require too much human analysis. If we can encode those trade-offs in the code itself in a form that's accessible to programmatic API, one can imagine implementing a machine learning system that can make those trade-off decisions automatically over time as everything else changes via traversing the search space of all those parameters. The programming language of today doesn't allow encoding such a high-level semantic unfortunately, but maybe it's possible to start small - e.g. which of the associative data structure to use can be chosen relatively easily, the initial size of datastructure can also be potentially chosen automatically based on some benchmarks or even metric from the real world metric, etc.
I don't think that exploding the state space of your program by making the history of your design decisions programmatically accessible (and changing them regularly to reflect new assumptions) would be good for the quality of the result.
> I’m not sure whether this has been true in the past
Phoronix.com has a lot of Clang benchmarks over the years.
I recall seeing some benchmark that showed that as Clang approached GCC in performance of compiled output, the compile speed also went down to approach GCC levels.
But I haven't managed to find that exact benchmark yet.
We plan on starting to track compile times for Linux kernel builds with llvm. If you have ideas for low hanging fruit in LLVM, we'd love to collaborate.
Shameless plug: https://github.com/dandavison/chronologer runs a benchmark (using hyperfine) over every commit in a repository (or specified git revision range) and produces a boxplot-time-series graph using vega-lite. It works but is rough and I haven't tried to polish it -- does another tool exist that does this?
This is interesting. I'm working in epidemiological modelling atm and something this would be pretty useful to run in a github action CI-style to find performance regressions over time.
Well this is interesting.
I thought I was the only one who noticed things getting slower.
For a couple releases now I've been thinking I was going crazy, as if something was only ever getting slower on my own machines. Glad to realise someone else illustrating some data to prove it.
Thanks, I'll definitely watch this conversation play out as others realise the obvious..
Pretty cool improvements. For any large project profiling it, making it faster, and preventing or reverting regressions can be a full-time job. Perhaps LLVM project needs such a person in such a role. Still, I question the utility of timing optimized builds. Usually when I have to wait for the compiler it's an incremental fastbuild to execute unit tests. Optimized builds usually happen while I'm busy doing something else.
Building llvm+clang from source is also ludicrous. 70 GB of diskspace usage and takes an hour to build, ridiculous. It's the static linking which is the culprit here, hundred of MB big binaries are a catastrophe for cache and memory subsystem. The funny thing is that my project also uses modules in D. Building the D compiler takes 10 seconds including unpacking of the tarball.
> I can’t say a 10% improvement is making LLVM fast again, we would need a 10x improvement for it to deserve that label. But it’s a start…
It’s a shame, one of the standout feature of llvm/clang used to be that it was faster than GCC. Today, an optimized build with gcc is faster than a debug build with clang. I don’t know if a 10x improvement is feasible, though; tcc is between 10-20x faster than gcc and clang, and part of the reason is that it does a lot less. The architecture of such a compiler may by necessity be too generic.
Here’s a table listing build times for one of my projects with and without optimizations in gcc, clang, and tcc. Tcc w/optimizations shown only for completeness; the time isn’t appreciably different. 20 runs each.
I think this is a worthy effort :-) I find the compile times of rust to be quite a big negative point.
However:
> For every tested commit, the programs are compiled in three different configurations: O3, ReleaseThinLTO and ReleaseLTO-g. All of these use -O3 in three different LTO configurations (none, thin and fat), with the last one also enabling debuginfo generation.
I would have thought for developer productivity tracking -O1 compile times would be better wouldn't it?
I'm happy for the CI to spend ages crunching out the best possible binary, but taking time out of the edit-compile-test loop would really help developers.
Hmm. This is one of the unexpected upsides to systems using JIT compilation that I guess we tend to take for granted. The very fact that a JITC runs in parallel to the app means the compiler developers care intensely about the performance of the compiler itself - any regression increases warmup time which is a closely tracked metric.
As long as you can tolerate the warmup, and at least for Java it's not really a big deal for many apps these days because C1/C2 are just so fast, you get fast iteration speeds with pretty good code generation too. The remaining performance pain points in Java apps are things like the lack of explicit vectorisation, value types etc, which are all being worked on.
I would greatly greatly appreciate an effort to benchmark builds without optimizations too. We've seen some LLVM-related slowdowns in Crystal, and --release compile times are far less important than non-release builds to us.
The root cause of the issue is that they should make mandatory for each pull request merged into llvm to (almost) not regress performance. The CI should have a bunch of canonical performance tests.
If it was made mandatory from the start llvm could have been far faster, it is not too late but it's time to put an end to this mediocrity
1) LLVM does not use pull-request and code-review isn't mandatory for the main contributors.
2) until a few months ago LLVM didn't even have any testing system before the code is pushed to the master branch: folks just push code directly after (hopefully) building/testing locally.
The real root cause is that no one cares enough to really invest deeply into this. There is also not a clear community guideline on what is acceptable (can I regress O2 compile time by 1% if I improve "some" benchmarks by 1%? Who defines the benchmarks suite? etc.)
Although tracking it would help, I see 2 issues (in opposite directions): I think there are times when slower performance is an acceptable cost. And, I think that if you allow tiny slowdowns, over time we'll get back here. There's judgment involved.
LLVM has devolved into complete garbage in Xcode for large Swift projects. Slowless aside, at least half the time it won't display values for variables after hitting a breakpoint, and my team has to resort to using print statements to debug issues.
The Rust is 10% slower metric I think is unfair. If you look on godbolt, the LLVM IR that rustc emits isn't that great, so LLVM has to take some extra time to optimize that, compared to the output of clang.
It's not 10% slower compared to clang but slower compared to the prior version of LLVM. That comparison IS fair as LLVM specifically invites people to target it.
[+] [-] londons_explore|5 years ago|reply
Most code is executed a lot more frequently than it is compiled, so if I can get a 1% speed increase with a 100x compile slowdown, I'll take it.
I don't want to see good PR's that improve LLVM delayed simply because they cause a speed regression.
[+] [-] fluffything|5 years ago|reply
LLVM is super configurable, so you can make it do whatever you want.
Clang defaults are tuned for the optimizations that give you the most bang for the time you put in, while still being able to compile a Web browser like Chrome or Firefox, or a whole Linux distribution packages, in a reasonable amount of time.
If you don't care about how long compile-times take, then you are not a "target" clang user, but you can just pass clang extra arguments like those mentioned above, or even fork it to add your own -Oeternity option that takes in your project and tries to compile it for a millenia on a super computer for that little extra 0.00001% reduction in code size, at best.
Because often, code compiled with -O3 is slower than code compiled with -O2. Because "more optimizations" do not necessarily mean "faster code" like you seem tobe suggesting.
[0]: https://github.com/google/souper [1]: https://github.com/rust-lang/rust/blob/master/src/rustllvm/P...
[+] [-] worldsayshi|5 years ago|reply
[+] [-] egwor|5 years ago|reply
[+] [-] fnord123|5 years ago|reply
As for the long baking compile, you have a concept of super compilation which will check the branches and determine if there are any places where a function is called with constants and partially evaluate (similar to partial application / prameter binding, but you get a newly compiled function) the function with those constants frozen. But then it has to determine if the branch elision makes it worth dropping the other versions of the function. It's a cool topic that I researched a lot about a decade ago but I think it's not an area with a lot of active interest in AOT compilers.
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50....
http://wiki.c2.com/?SuperCompiler
https://en.wikipedia.org/wiki/Stalin_(Scheme_implementation)
[+] [-] bjoli|5 years ago|reply
[+] [-] wallnuss|5 years ago|reply
[+] [-] pfdietz|5 years ago|reply
Is that really true? I'd have thought most code outside inner loops benefits almost negligibly from optimization.
[+] [-] embrassingstuff|5 years ago|reply
Let's set aside thecnicalities and assume it a real X5 improvements and all the files are mirrored seamlessly.
[+] [-] DerSaidin|5 years ago|reply
As the author of one of the changes which could have unknowingly causing a 1% regression, I really appreciate this work measuring and monitoring compile times. Thanks to nikic for noticing the regression and finding a solution to avoid it.
[+] [-] fluffything|5 years ago|reply
But I guess the LLVM project should probably start by making code-reviews mandatory, gating PRs on passing tests so that master doesn't get broken all the time, etc. I really hate it when I update my LLVM locally from git master, and it won't even build because somebody pushed to master without even testing that their changes compile...
For Rust, I hope Cranelift really takes off someday, and we can start to completely ditch LLVM and make it opt-in, only for those cases in which you are willing to trade-off huge compile-times for that last 1% run-time reduction.
[+] [-] drivebycomment|5 years ago|reply
> Waymarking was previously employed to avoid explicitly storing the user (or “parent”) corresponding to a use. Instead, the position of the user was encoded in the alignment bits of the use-list pointers (across multiple pointers). This was a space-time tradeoff and reportedly resulted in major memory usage reduction when it was originally introduced. Nowadays, the memory usage saving appears to be much smaller, resulting in the removal of this mechanism. (The cynic in me thinks that the impact is lower now, because everything else uses much more memory.)
Any seasoned programmers would remember a few of such things - you undo a decision made years ago because the assumptions have changed.
Programmers often make these kinds of trade-off choices based on the current state (typical machines the program runs, and typical inputs the program deals with, and the current version of everything else in the program). But all of those environmental factors change over time, which can make the input to the trade-off quite different. Yet, it's difficult to revisit all those decisions systematically as they require too much human analysis. If we can encode those trade-offs in the code itself in a form that's accessible to programmatic API, one can imagine implementing a machine learning system that can make those trade-off decisions automatically over time as everything else changes via traversing the search space of all those parameters. The programming language of today doesn't allow encoding such a high-level semantic unfortunately, but maybe it's possible to start small - e.g. which of the associative data structure to use can be chosen relatively easily, the initial size of datastructure can also be potentially chosen automatically based on some benchmarks or even metric from the real world metric, etc.
[+] [-] adrianN|5 years ago|reply
[+] [-] jeffdavis|5 years ago|reply
Tooling, runtime sampling, or just code review could reveal when the assumptions go awry.
[+] [-] nyanpasu64|5 years ago|reply
[+] [-] nh2|5 years ago|reply
Phoronix.com has a lot of Clang benchmarks over the years.
I recall seeing some benchmark that showed that as Clang approached GCC in performance of compiled output, the compile speed also went down to approach GCC levels.
But I haven't managed to find that exact benchmark yet.
[+] [-] baybal2|5 years ago|reply
Pretty much the sole point of Clang/LLVM to the corporate sponsors is to get the GCC, but without GPL
[+] [-] ndesaulniers|5 years ago|reply
[+] [-] Myrmornis|5 years ago|reply
[+] [-] The_Amp_Walrus|5 years ago|reply
I did a quick Google and found this: https://github.com/marketplace/actions/continuous-benchmark
[+] [-] aogl|5 years ago|reply
[+] [-] jeffbee|5 years ago|reply
[+] [-] tbodt|5 years ago|reply
[+] [-] schlupa|5 years ago|reply
[+] [-] jcelerier|5 years ago|reply
[+] [-] moonchild|5 years ago|reply
It’s a shame, one of the standout feature of llvm/clang used to be that it was faster than GCC. Today, an optimized build with gcc is faster than a debug build with clang. I don’t know if a 10x improvement is feasible, though; tcc is between 10-20x faster than gcc and clang, and part of the reason is that it does a lot less. The architecture of such a compiler may by necessity be too generic.
Here’s a table listing build times for one of my projects with and without optimizations in gcc, clang, and tcc. Tcc w/optimizations shown only for completeness; the time isn’t appreciably different. 20 runs each.
[+] [-] judofyr|5 years ago|reply
Did you mean "an optimized build with gcc is faster than a debug build with clang"?
[+] [-] underdeserver|5 years ago|reply
[+] [-] oblio|5 years ago|reply
[+] [-] nickcw|5 years ago|reply
However:
> For every tested commit, the programs are compiled in three different configurations: O3, ReleaseThinLTO and ReleaseLTO-g. All of these use -O3 in three different LTO configurations (none, thin and fat), with the last one also enabling debuginfo generation.
I would have thought for developer productivity tracking -O1 compile times would be better wouldn't it?
I'm happy for the CI to spend ages crunching out the best possible binary, but taking time out of the edit-compile-test loop would really help developers.
[+] [-] bluGill|5 years ago|reply
[+] [-] thu2111|5 years ago|reply
As long as you can tolerate the warmup, and at least for Java it's not really a big deal for many apps these days because C1/C2 are just so fast, you get fast iteration speeds with pretty good code generation too. The remaining performance pain points in Java apps are things like the lack of explicit vectorisation, value types etc, which are all being worked on.
[+] [-] RX14|5 years ago|reply
[+] [-] NCG_Mike|5 years ago|reply
"#pragma once" in the header files helps as does using a pre-compiled header file.
Obviously, removing header files that aren't needed makes a difference too.
[+] [-] brandmeyer|5 years ago|reply
[+] [-] gameswithgo|5 years ago|reply
[+] [-] khan2020|5 years ago|reply
[deleted]
[+] [-] The_rationalist|5 years ago|reply
[+] [-] Joky|5 years ago|reply
1) LLVM does not use pull-request and code-review isn't mandatory for the main contributors.
2) until a few months ago LLVM didn't even have any testing system before the code is pushed to the master branch: folks just push code directly after (hopefully) building/testing locally.
The real root cause is that no one cares enough to really invest deeply into this. There is also not a clear community guideline on what is acceptable (can I regress O2 compile time by 1% if I improve "some" benchmarks by 1%? Who defines the benchmarks suite? etc.)
[+] [-] yjftsjthsd-h|5 years ago|reply
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] zelphirkalt|5 years ago|reply
[+] [-] renewiltord|5 years ago|reply
[+] [-] ndesaulniers|5 years ago|reply
[+] [-] xvilka|5 years ago|reply
[+] [-] adrianN|5 years ago|reply
[+] [-] fakename11|5 years ago|reply
[+] [-] dangwu|5 years ago|reply
[+] [-] dgentile|5 years ago|reply
[+] [-] est31|5 years ago|reply