Exploring Clang/LLVM optimization on programming horror

[+] woodruffw|4 years ago|reply

Great writeup! This is a nice overview of how LLVM's optimization passes compose to produce nicely optimized assembly.

One nitpicky observation: mem2reg doesn't "move values from memory [...] to registers (directly inside the CPU)." It's an SSA expansion pass, taking the minimal SSA form produced by the C/C++ frontend to turning it into a more aggressive SSA form by eliminating as many `alloca`s as possible. SSA is characterized by the presence of infinitely many "fresh" abstract registers, none of which correspond to actual CPU registers -- LLVM is responsible at a later stage for performing efficient lowering and allocation of registers/stack slots for the SSA registers.

[+] CyberShadow|4 years ago|reply

Not nitpicky at all - I was confused as to why one would want to do register allocation before control/data flow analysis. Thanks for the explanation!

[+] slavik81|4 years ago|reply

Could you give an example of an alloca mem2reg would eliminate?

[+] dragontamer|4 years ago|reply

https://llvm.org/docs/Passes.html#passes-mem2reg

> -mem2reg: Promote Memory to Register

> This file promotes memory references to be register references. It promotes alloca instructions which only have loads and stores as uses. An alloca is transformed by using dominator frontiers to place phi nodes, then traversing the function in depth-first order to rewrite loads and stores as appropriate. This is just the standard SSA construction algorithm to construct “pruned” SSA form.

-----------

https://llvm.org/docs/LangRef.html#alloca-instruction

> The ‘alloca’ instruction allocates memory on the stack frame of the currently executing function, to be automatically released when this function returns to its caller. If the address space is not explicitly specified, the object is allocated in the alloca address space from the datalayout string.

--------

Based on my reading and understanding of alloca and mem2reg above, I have to disagree with your assessment. It seems like alloca roughly corresponds to a "push" in your typical x86 assembly language, or maybe a "add esp, #ofBytes".

By removing alloca instructions, the mem2reg step is turning stack-memory into registers.

[+] eatonphil|4 years ago|reply

The core simplification is due to LLVM's induction-detection step. Even after doing induction-based proofs in school, until reading this I had the sense that induction is kinda magical and kinda not real.

I still cannot fathom how you can encode induction detection into an algorithm, any pointers welcome (keeping it simple, please, and ideally with working code).

The only case that makes sense to me is if you did numeric computation against some fixed number of cases and if that worked out then you assume it's right.

I guess this is what proof assistants do (among other things). Maybe I should look into how to write a basic proof assistant.

[+] jcranmer|4 years ago|reply

Induction detection is relatively simple in LLVM IR. The first thing to remember is that the compiler doesn't work on the same source code that the human does; it works on a different representation. The canonical loop structure would look something like this:

  loop.preheader:
    ; This is the block that executes before the loop does.
    br label %loop.body;

  loop.body:
    %i = phi i32 [0, %loop.preheader], [%i.next, %loop.body]
    %val = phi i1 [0, %loop.preheader], [%val.next, %loop.body]
    %val.next = xor i1 %val, -1
    %i.next = add i32 %i, 1
    %cmp = icmp slt i32 %i, %N
    br i1 %cmp, %loop.exit, %loop.body

  loop.exit:
    ; After the loop is done

Because of how phis work, in constructing this form, we immediately realize how to describe a value either in terms of its value in the first iteration (this is the value it takes coming from the preheader), or in terms solely of its value in the previous iteration. In other words, %val.next = f(%val) for some function f, which we can easily identify in the source code.

Now, if we can compute the repeated composition of f easily, we can replace the recurrence in the loop with repeated composition. Repeated addition of the same value is multiplication, and repeated multiplication of the same value is exponentiation--that's two very easy examples to do so. We recognize that there is no use of the value except of its final occurrence, and so instead of doing the loop, we can generate the composition itself. Thus, after N iterations of the loop, we can conclude that %val.next = f^N(0), and if we can identify the loop iteration count (that's %N in this case), we can simply write out f^N instead of having to compute every individual value.

[+] contravariant|4 years ago|reply

Well in this case it seems reasonable that they simply added a rule for the specific line:

    even = !even

and the line

    numberCompare++

resulting in

    even = (indvar & 1)
    numberCompare = indvar

You can then solve from the exit condition that number == numberCompare, which gives you the indvar which can then be substituted into 'even'.

I'm not saying it isn't magical, and certainly requires some symbolic solving, but it's doable.

Of course the real goal is to eventually get it to optimize

    while (number != 1) {
        if (number & 1)
            number = number * 3 + 1
        number >>= 1;
    }

[+] vnorilo|4 years ago|reply

Well, basically it's about converting iteratively computated state to a function of iteration count.

Consider indvar[n] = indvar[n-1] + k, indvar[0] = a. It follows that indvar[n] = a + n * k.

This is the type of indvar canonicalization that can zap loops such as the one in the article. Suddenly the final state is just a function of the loop trip count.

The return value is replaced with canonicalized indvar. That makes the loop itself dead (nobody observes any of its effects) and removed by subsequent dead code elimination pass.

My example transforms addition to multiplication. Another common one is multiplication into power. That's close to what's going on in the article example: xor is just a multiplication of signed one-bit integers.

[+] geofft|4 years ago|reply

This appears to be the source code: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Tran... It's long, but it looks well-commented! Some Googling also finds https://llvm.org/devmtg/2009-10/ScalarEvolutionAndLoopOptimi... which looks relevant.

I think "induction" here isn't exactly the sense of proofs by mathematical induction as you learn in school (as in proving something for all positive integers given a base case and a recursive case) - I think all they mean by it is "a loop, over a single integer." Quoting https://en.wikipedia.org/wiki/Induction_variable :

> In computer science, an induction variable is a variable that gets increased or decreased by a fixed amount on every iteration of a loop or is a linear function of another induction variable.

The description at https://www.cs.cornell.edu/courses/cs6120/2019fa/blog/ive/ makes it clearer:

> An induction variable is any variable whose value can be represented as a function of: loop invariants; the number of loop iterations that have executed; and other induction variables.

That is to say, the link to proofs-by-induction is that if you have a loop like "for (i = 0; i < 10; i++)", you can prove the statement "after N loops, i = N". Why? Because after 0 loops, i = 0 (the first part), and on each loop, i is one greater (the i++ part), and nothing else that happens in the loop changes i (which is what that post means by "loop invariants").

In other words, it's not that the compiler is detecting that the programmer is doing induction (the programmer is not, in fact the programmer is simply writing a for loop and not thinking about it), it's that the compiler is performing induction on the loop.

So a simple case is if you have a loop like

    int i;
    for (i = 0; i < 10; i++)
        printf("Hi!\n");
    printf("i = %d\n", i");

You can prove that at the end of the loop, i = 10, and you don't need to do actually increment i during each iteration to get there. And if you didn't have that printf (i.e., if the loop body was empty), you could optimize the loop out entirely and just print out 10.

So it's not terribly magical that LLVM is able to turn

    while (number != numberCompare)
    {
        even = !even;
        numberCompare++;
    }

into, in pseudocode,

    numberCompare = number;
    even = !!!...!!!even (numberCompare times);

i.e., that it's able to get numberCompare out of the loop, to conclude that even in turn is only a function of numberCompare, and then that there's no loop left.

What's more magical to me, I think, is that it's able to make sense of "run the ! operation numberCompare times" and turn that into "if (numberCompare % 2 == 1) even = !even". I don't think that's too surprising either - a compiler should know that, for booleans, !!foo can be simplified to foo. But repeatedly applying that rule across an abstract number of calls to !!, plus possibly one more, seems pretty clever.

If I'm skimming the code right, that part of the simplification is a different optimization pass entirely. I don't know off hand where that's implemented. I'd be curious if someone can find it.

[+] benmmurphy|4 years ago|reply

the induction step is pretty cool. it will remove the loop calculating the sum of an arithmetic progression and replace it with just multiplies/shifts/subtracts and adds.

[+] CalChris|4 years ago|reply

These optimizations are on LLVM IR into LLVM IR. So basically every backend benefits. I don't think most backend engineers would even understand them. I don't.

[+] woodruffw|4 years ago|reply

> I don't think most backend engineers would even understand them. I don't.

Most engineers are intimidated by compilers, but they really shouldn't be! LLVM's source code is very approachable and well-isolated by concern, and most of the fundamental concepts in an optimizing compiler are understandable with a beginner's understanding of data structures, assembly, and resource allocation.

One slightly funny thing, though: LLVM's optimizations are on IR, so every frontend can theoretically benefit from them. However, many assume patterns originating from the C/C++ frontend or heavy canonicalization, so you need to be a little careful as a language frontend developer to take full advantage of all optimizations!

[+] dragontamer|4 years ago|reply

None of these steps are too hard to think about in SSA form (a specific transformation of code that LLVM does near the beginning of its analysis).

So the key is to first understand SSA form. After that, a lot of optimization passes become obvious to think about.

---------

SSA has two bits to understand:

1. Single static assignment -- variables can be assigned, but once assigned they can never change. (This part is easy).

2. Graph-based, and therefore requires the 'Phi' function to understand -- variables may take on two different values. EX: "r0" may come from Node#1, while "r0" comes from Node#2. r0 may have been renamed %1 in #1, and %2 in #2. "Phi" is a "fake function" that resolves the ambiguity and kinda makes SSA code "real" again.

Hmm, maybe that's a bad explanation. But whatever, my point is that understanding "Phi" is the concept you want to focus on in SSA. Phi is the harder part, but its still not that hard to understand.

Once you get it, the whole form "clicks" in your brain and you suddenly understand how optimizations can become possible.

[+] srcreigh|4 years ago|reply

EDIT - nvm, the CS241 UWaterloo course doesn't cover SSA.

In case anybody's interested in learning the basics, here's some course notes to a compilers class which I love!

http://anthony-zhang.me/University-Notes/CS241/CS241.html

[+] thrasumachos|4 years ago|reply

Nice that it even provides the right answer for negative numbers as undefined behavior but only when optimizations are enabled!

[+] titzer|4 years ago|reply

Nice catch. One of many instances where C/C++ compilers rely on integer wraparound being UB.

[+] ufo|4 years ago|reply

I'm very curious how it optimizes that recursive version at the end, because it is not tail recursive.

Does anyone know? Perhaps it becomes tail recursive after one round of inlining?

[+] pertymcpert|4 years ago|reply

It seems the TailRecursionElimination.cpp pass is responsible for that. It has some smarts to know that the xor instruction between the call and return can be turned into an "accumulator", although what exactly that means I'm unsure.

[+] martincmartin|4 years ago|reply

you get a linear time O(n) isEven function

In complexity theory, the size of a problem is the number of bits to represent the input, so if the input integer is i, then the size is n = log_2(i) so the algorithm is actually exponential in the number of bits it takes to represent the input.

[+] dragontamer|4 years ago|reply

In a strict sense, you're right.

But in practice, you're not. Case in point: matrix multiplication is commonly quoted as O(n^3) (naive), when in fact, the amount of data used is O(n^2), and therefore should be quoted as O(size^2) (quadratic) with respect to data-size. (EDIT: was bad at math for a second).

But no, people mean "n-by-n matrix" has "O(n^3) naive" implementation. In practice, the "n" is just whatever is most convenient for a problem.

In many cases, n is proportional to the input size. But in other cases (such as the famous matrix multiplication examples), n is the sqrt(input-size).

[+] trias|4 years ago|reply

is complexity theory constrained to binary representations? Why should it?

[+] PostThisTooFast|4 years ago|reply

[deleted]

[+] jagrsw|4 years ago|reply

  O(n) isEven function compared to the obvious constant time O(1) modulo algorithm

In the holy name of nitpickiness: the modulo operation is not always O(1), as div/mod CPU instructions have complex internal implementations and require varying number of cycles depending on what the input args are.

However, for x%2 it'll be probably O(1), both on the cpu level, and b/c it'll get optimized to something like x&1 by most compilers (at least for unsigned values)

[+] 8192kjshad09-|4 years ago|reply

I will nitpick your nitpick. In a language with arbitrary size integers you're right. This is a C++ int, therefore 32/64 bits (in all sane cases, ik the spec doesn't specify), so the maximum number of cycles is upper bounded. Therefore modulo on C++ ints is is O(1).

[+] athrowaway3z|4 years ago|reply

It states "constant time O(1) modulo _algorithm_" for implementing an isEven function.

68 comments