Make LLVM Fast Again

[+] londons_explore|5 years ago|reply

Am I the only one who wants to see a split into a "fast compile" mode and a "spend hours making every optimization possible" mode?

Most code is executed a lot more frequently than it is compiled, so if I can get a 1% speed increase with a 100x compile slowdown, I'll take it.

I don't want to see good PR's that improve LLVM delayed simply because they cause a speed regression.

[+] fluffything|5 years ago|reply

You can already spend as much as you'd like on optimizations if you are using LLVM. Just use a superoptimizer [0]:

    clang -Xclang -load -Xclang libsouperPass.so -mllvm -z3-path=/usr/bin/z3

, or increase the inlining heuristics..., or just create your own optimization strategy like rustc does [1], or...

LLVM is super configurable, so you can make it do whatever you want.

Clang defaults are tuned for the optimizations that give you the most bang for the time you put in, while still being able to compile a Web browser like Chrome or Firefox, or a whole Linux distribution packages, in a reasonable amount of time.

If you don't care about how long compile-times take, then you are not a "target" clang user, but you can just pass clang extra arguments like those mentioned above, or even fork it to add your own -Oeternity option that takes in your project and tries to compile it for a millenia on a super computer for that little extra 0.00001% reduction in code size, at best.

Because often, code compiled with -O3 is slower than code compiled with -O2. Because "more optimizations" do not necessarily mean "faster code" like you seem tobe suggesting.

[0]: https://github.com/google/souper [1]: https://github.com/rust-lang/rust/blob/master/src/rustllvm/P...

[+] worldsayshi|5 years ago|reply

Yeah this seems like two very different use cases. Stating the obvious: when debugging I want as fast builds as possible. When shipping I want as fast software as possible.

[+] egwor|5 years ago|reply

This should be an optimisation that you choose to enable, though, right? It might be prohibitively expensive to run the compile for other reasons (time to do a deploy for a fix, time and cost to run builds for CI etc.) so it has to be something that can be dialled up/down.

[+] fnord123|5 years ago|reply

For fast compiles, in clang, -fsyntax-only parses as quick as possible. Then you have -O{0,1,2,3s} for various levels of optimization.

As for the long baking compile, you have a concept of super compilation which will check the branches and determine if there are any places where a function is called with constants and partially evaluate (similar to partial application / prameter binding, but you get a newly compiled function) the function with those constants frozen. But then it has to determine if the branch elision makes it worth dropping the other versions of the function. It's a cool topic that I researched a lot about a decade ago but I think it's not an area with a lot of active interest in AOT compilers.

https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50....

http://wiki.c2.com/?SuperCompiler

https://en.wikipedia.org/wiki/Stalin_(Scheme_implementation)

[+] bjoli|5 years ago|reply

Guile scheme is getting a "baseline- compiler for that reason: it will compile fast, but not do any of the fancy CPS things that the regular compiler does.

[+] wallnuss|5 years ago|reply

It all depends on which optimizations you enable and LLVM is very flexible, albeit sometimes you still spend 20% of your time in ISel (Instruction Selection)...

[+] pfdietz|5 years ago|reply

> Most code is executed a lot more frequently than it is compiled, so if I can get a 1% speed increase with a 100x compile slowdown, I'll take it.

Is that really true? I'd have thought most code outside inner loops benefits almost negligibly from optimization.

[+] embrassingstuff|5 years ago|reply

How many people would.use a cloud compiler?

Let's set aside thecnicalities and assume it a real X5 improvements and all the files are mirrored seamlessly.

[+] DerSaidin|5 years ago|reply

This work is critical to compile times improving.

As the author of one of the changes which could have unknowingly causing a 1% regression, I really appreciate this work measuring and monitoring compile times. Thanks to nikic for noticing the regression and finding a solution to avoid it.

[+] fluffything|5 years ago|reply

I really hope this type of infrastructure gets moved into LLVM itself, and people start adding more benchmarks for all the frontends, and somehow integrating this into the CI infrastructure, to be able to block merging PRs on changes that accidentally impact LLVM's performance, like is currently the case for rustc.

But I guess the LLVM project should probably start by making code-reviews mandatory, gating PRs on passing tests so that master doesn't get broken all the time, etc. I really hate it when I update my LLVM locally from git master, and it won't even build because somebody pushed to master without even testing that their changes compile...

For Rust, I hope Cranelift really takes off someday, and we can start to completely ditch LLVM and make it opt-in, only for those cases in which you are willing to trade-off huge compile-times for that last 1% run-time reduction.

[+] drivebycomment|5 years ago|reply

Good read. One thought-provoking bit for me was:

> Waymarking was previously employed to avoid explicitly storing the user (or “parent”) corresponding to a use. Instead, the position of the user was encoded in the alignment bits of the use-list pointers (across multiple pointers). This was a space-time tradeoff and reportedly resulted in major memory usage reduction when it was originally introduced. Nowadays, the memory usage saving appears to be much smaller, resulting in the removal of this mechanism. (The cynic in me thinks that the impact is lower now, because everything else uses much more memory.)

Any seasoned programmers would remember a few of such things - you undo a decision made years ago because the assumptions have changed.

Programmers often make these kinds of trade-off choices based on the current state (typical machines the program runs, and typical inputs the program deals with, and the current version of everything else in the program). But all of those environmental factors change over time, which can make the input to the trade-off quite different. Yet, it's difficult to revisit all those decisions systematically as they require too much human analysis. If we can encode those trade-offs in the code itself in a form that's accessible to programmatic API, one can imagine implementing a machine learning system that can make those trade-off decisions automatically over time as everything else changes via traversing the search space of all those parameters. The programming language of today doesn't allow encoding such a high-level semantic unfortunately, but maybe it's possible to start small - e.g. which of the associative data structure to use can be chosen relatively easily, the initial size of datastructure can also be potentially chosen automatically based on some benchmarks or even metric from the real world metric, etc.

[+] adrianN|5 years ago|reply

I don't think that exploding the state space of your program by making the history of your design decisions programmatically accessible (and changing them regularly to reflect new assumptions) would be good for the quality of the result.

[+] jeffdavis|5 years ago|reply

Setting aside the AI angle, perhaps just recording the assumptions in a way that can be measured would be enough.

Tooling, runtime sampling, or just code review could reveal when the assumptions go awry.

[+] nyanpasu64|5 years ago|reply

Does FFTW automatically generate implementations optimized for each machine?

[+] nh2|5 years ago|reply

> I’m not sure whether this has been true in the past

Phoronix.com has a lot of Clang benchmarks over the years.

I recall seeing some benchmark that showed that as Clang approached GCC in performance of compiled output, the compile speed also went down to approach GCC levels.

But I haven't managed to find that exact benchmark yet.

[+] baybal2|5 years ago|reply

An expected result when they copy GCC's features, and functionality. Isn't it?

Pretty much the sole point of Clang/LLVM to the corporate sponsors is to get the GCC, but without GPL

[+] ndesaulniers|5 years ago|reply

We plan on starting to track compile times for Linux kernel builds with llvm. If you have ideas for low hanging fruit in LLVM, we'd love to collaborate.

[+] Myrmornis|5 years ago|reply

Shameless plug: https://github.com/dandavison/chronologer runs a benchmark (using hyperfine) over every commit in a repository (or specified git revision range) and produces a boxplot-time-series graph using vega-lite. It works but is rough and I haven't tried to polish it -- does another tool exist that does this?

[+] The_Amp_Walrus|5 years ago|reply

This is interesting. I'm working in epidemiological modelling atm and something this would be pretty useful to run in a github action CI-style to find performance regressions over time.

I did a quick Google and found this: https://github.com/marketplace/actions/continuous-benchmark

[+] aogl|5 years ago|reply

Well this is interesting. I thought I was the only one who noticed things getting slower. For a couple releases now I've been thinking I was going crazy, as if something was only ever getting slower on my own machines. Glad to realise someone else illustrating some data to prove it. Thanks, I'll definitely watch this conversation play out as others realise the obvious..

[+] jeffbee|5 years ago|reply

Pretty cool improvements. For any large project profiling it, making it faster, and preventing or reverting regressions can be a full-time job. Perhaps LLVM project needs such a person in such a role. Still, I question the utility of timing optimized builds. Usually when I have to wait for the compiler it's an incremental fastbuild to execute unit tests. Optimized builds usually happen while I'm busy doing something else.

[+] tbodt|5 years ago|reply

The problem with that is LLVM non-optimized codegen is so bad that many projects build with -O2 even in debug mode.

[+] schlupa|5 years ago|reply

Building llvm+clang from source is also ludicrous. 70 GB of diskspace usage and takes an hour to build, ridiculous. It's the static linking which is the culprit here, hundred of MB big binaries are a catastrophe for cache and memory subsystem. The funny thing is that my project also uses modules in D. Building the D compiler takes 10 seconds including unpacking of the tarball.

[+] jcelerier|5 years ago|reply

I build LLVM+Clang regularly and it definitely does not take 70GB.

[+] moonchild|5 years ago|reply

> I can’t say a 10% improvement is making LLVM fast again, we would need a 10x improvement for it to deserve that label. But it’s a start…

It’s a shame, one of the standout feature of llvm/clang used to be that it was faster than GCC. Today, an optimized build with gcc is faster than a debug build with clang. I don’t know if a 10x improvement is feasible, though; tcc is between 10-20x faster than gcc and clang, and part of the reason is that it does a lot less. The architecture of such a compiler may by necessity be too generic.

Here’s a table listing build times for one of my projects with and without optimizations in gcc, clang, and tcc. Tcc w/optimizations shown only for completeness; the time isn’t appreciably different. 20 runs each.

  ┌─────────────────────────────┬──────────┬──────────┬──────────┬─────────┬────────────┬────────────┐
  │                             │Clang -O2 │Clang -O0 │GCC -O2   │GCC -O0  │TCC -O2     │TCC -O0     │
  ├─────────────────────────────┼──────────┼──────────┼──────────┼─────────┼────────────┼────────────┤
  │Average time (s)             │1.49 ±0.11│1.24 ±0.08│1.06 ±0.08│0.8 ±0.04│0.072 ±0.011│0.072 ±0.014│
  ├─────────────────────────────┼──────────┼──────────┼──────────┼─────────┼────────────┼────────────┤
  │Speedup compared to clang -O2│        - │     1.20 │     1.40 │    1.86 │      20.59 │      20.69 │
  ├─────────────────────────────┼──────────┼──────────┼──────────┼─────────┼────────────┼────────────┤
  │Slowdown compared to TCC     │    20.68 │    17.20 │    17.72 │   11.12 │          - │          - │
  └─────────────────────────────┴──────────┴──────────┴──────────┴─────────┴────────────┴────────────┘

[+] judofyr|5 years ago|reply

> Today, an optimized build with gcc is slower than a debug build with clang.

Did you mean "an optimized build with gcc is faster than a debug build with clang"?

[+] underdeserver|5 years ago|reply

If you're comparing optimizations, I'd also want to see the runtime of a reference program.

[+] oblio|5 years ago|reply

How did you make the table in your comment?

[+] nickcw|5 years ago|reply

I think this is a worthy effort :-) I find the compile times of rust to be quite a big negative point.

However:

> For every tested commit, the programs are compiled in three different configurations: O3, ReleaseThinLTO and ReleaseLTO-g. All of these use -O3 in three different LTO configurations (none, thin and fat), with the last one also enabling debuginfo generation.

I would have thought for developer productivity tracking -O1 compile times would be better wouldn't it?

I'm happy for the CI to spend ages crunching out the best possible binary, but taking time out of the edit-compile-test loop would really help developers.

[+] bluGill|5 years ago|reply

Both are worth tracking. If O3 is doing useless work then I'll take the speed up. If it is twice as long for a 1% improvement I'll take it.

[+] thu2111|5 years ago|reply

Hmm. This is one of the unexpected upsides to systems using JIT compilation that I guess we tend to take for granted. The very fact that a JITC runs in parallel to the app means the compiler developers care intensely about the performance of the compiler itself - any regression increases warmup time which is a closely tracked metric.

As long as you can tolerate the warmup, and at least for Java it's not really a big deal for many apps these days because C1/C2 are just so fast, you get fast iteration speeds with pretty good code generation too. The remaining performance pain points in Java apps are things like the lack of explicit vectorisation, value types etc, which are all being worked on.

[+] RX14|5 years ago|reply

I would greatly greatly appreciate an effort to benchmark builds without optimizations too. We've seen some LLVM-related slowdowns in Crystal, and --release compile times are far less important than non-release builds to us.

[+] NCG_Mike|5 years ago|reply

A couple of things a C++ developer can do is to put template instantiation code into a .cpp file, where possible.

"#pragma once" in the header files helps as does using a pre-compiled header file.

Obviously, removing header files that aren't needed makes a difference too.

[+] brandmeyer|5 years ago|reply

`pragma once` doesn't do anything that a well-written header guard does.

[+] gameswithgo|5 years ago|reply

to Nikic, thank you for this effort.

[+] khan2020|5 years ago|reply

[deleted]

[+] The_rationalist|5 years ago|reply

The root cause of the issue is that they should make mandatory for each pull request merged into llvm to (almost) not regress performance. The CI should have a bunch of canonical performance tests. If it was made mandatory from the start llvm could have been far faster, it is not too late but it's time to put an end to this mediocrity

[+] Joky|5 years ago|reply

You may not know, but:

1) LLVM does not use pull-request and code-review isn't mandatory for the main contributors.

2) until a few months ago LLVM didn't even have any testing system before the code is pushed to the master branch: folks just push code directly after (hopefully) building/testing locally.

The real root cause is that no one cares enough to really invest deeply into this. There is also not a clear community guideline on what is acceptable (can I regress O2 compile time by 1% if I improve "some" benchmarks by 1%? Who defines the benchmarks suite? etc.)

[+] yjftsjthsd-h|5 years ago|reply

Although tracking it would help, I see 2 issues (in opposite directions): I think there are times when slower performance is an acceptable cost. And, I think that if you allow tiny slowdowns, over time we'll get back here. There's judgment involved.

[+] unknown|5 years ago|reply

[deleted]

[+] zelphirkalt|5 years ago|reply

Wouldn't that lead to hitting a local maximum / minimum and getting stuck there?

[+] renewiltord|5 years ago|reply

This is just armchair quarterbacking. Utterly useless.

[+] ndesaulniers|5 years ago|reply

Only a Sith deals in absolutes.

[+] xvilka|5 years ago|reply

Reimplementing LLVM in Rust could make a big difference as well.

[+] adrianN|5 years ago|reply

Why do you think that? Rust and C++ are reasonably close in performance.

[+] fakename11|5 years ago|reply

I doubt this and the amount of effort this would take would be an huge waste.

[+] dangwu|5 years ago|reply

LLVM has devolved into complete garbage in Xcode for large Swift projects. Slowless aside, at least half the time it won't display values for variables after hitting a breakpoint, and my team has to resort to using print statements to debug issues.

[+] dgentile|5 years ago|reply

The Rust is 10% slower metric I think is unfair. If you look on godbolt, the LLVM IR that rustc emits isn't that great, so LLVM has to take some extra time to optimize that, compared to the output of clang.

[+] est31|5 years ago|reply

It's not 10% slower compared to clang but slower compared to the prior version of LLVM. That comparison IS fair as LLVM specifically invites people to target it.

235 comments