top | item 21267445

Mispredicted branches can multiply your running times

114 points| ingve | 6 years ago |lemire.me | reply

110 comments

[+] kstenerud|6 years ago|reply

This is the main reason why I've codified a bunch of bit tricks so I don't forget them [1]. They're often not worth using, but can sometimes work as a last-mile optimization after you've done all your algorithmic changes, cache locality, sizing, alignment, etc.

[1] https://github.com/kstenerud/bit-tricks

[+] omgtehlion|6 years ago|reply

> https://github.com/kstenerud/bit-tricks/blob/master/bytes_re...

looks like over-engineering of (bits + 7) >> 3

[+] jacobolus|6 years ago|reply

In https://github.com/kstenerud/bit-tricks/blob/master/sign.h

Why not use: (x > 0) - (x < 0)

[+] mac01021|6 years ago|reply

I am not a computer engineer and I always wondered:

What if we had a modern, pipelined CPU architecture with no branch predictor, which instead unconditionally executed the X instructions immediately following every branch before (possibly) proceeding with the branch's target instruction?

Would compilers and programmers be able to find most of the efficiencies that we currently rely on the branch predictor to find? What would be the common cases where it would be hard to do that?

How much circuitry would we save by leaving out the predictor? Enough to allow a measurable speedup in the CPU clock speed?

[+] gpderetta|6 years ago|reply

You just reinvented delay slots [1].

Two issues:

- It is not easy for the compiler to always fill up a single delay slot. Filling dozen of them (as required for a deeply pipelined modern processor) would be significantly harder.

- The number of delay slots would depend on the depth of the cpu pipeline. If you do not want to expose microarchitectural details in your ISA, you need either JIT or install time specialization. edit: or stick with a potentially suboptimal number.

[1] https://en.wikipedia.org/wiki/Delay_slot

[+] ThePadawan|6 years ago|reply

Not a computer engineer either, but I did do some GPGPU in college (~6 years ago).

Notably, branching on a flag f on the GPU was so slow that most of the time it was faster to (manually) compute both branches b1 and b2 and then calculate the result as r = f * b1 + (1-f) * b2 (where f is either 0 or 1).

[+] ericpauley|6 years ago|reply

Aside from other comments about weaknesses of delay slots, modern processors already do this automatically in the form of out of order execution. A branch can be moved up and assigned priority to an execution unit if it doesn't depend on some of the previous instructions. These instructions can be queued or assigned to other execution units.

[+] marcosdumay|6 years ago|reply

> How much circuitry would we save by leaving out the predictor? Enough to allow a measurable speedup in the CPU clock speed?

Considering the complex predictors on current PCs, we would save quite a lot of circuitry. But that circuitry is there because it is the most effective place to increase the CPU speed, if you used it for something else, speed would go down, not up (but power consumption would improve).

Also, actual clock speed isn't really relevant and has a complex relation to CPU speed. That circuitry isn't affecting clock speed, so it wouldn't change.

[+] saagarjha|6 years ago|reply

> If you are accessing the content of an array, many languages will add “bound checking”: before accessing the array value, there will be a hidden check to see whether the index is valid. If the index is not valid, then an error is generated, otherwise the code proceeds normally. Bound checks are predictable since all accesses should (normally) be valid. Consequently, most processors should be able to predict the outcome nearly perfectly.

Warning! This is just a stone's throw from the kind of thing that got us Spectre, so keep in mind this sort of optimization can occasionally come back to bite you…

[+] rdc12|6 years ago|reply

At the software level, you don't get a choice about this thou, the CPU is going to speculate about a bounds check regardless, the optimization is it a different abstraction layer.

Pretty much any thing on the market with the compute power of a cellphone, is going to have a feature like this too

[+] Const-me|6 years ago|reply

> However, there are tricks that compilers cannot use safely.

Why not? VC++ emits cmov quite often, for operator ? and similar code.

[+] dvhh|6 years ago|reply

related stackoverflow example:

https://stackoverflow.com/questions/11227809/why-is-processi...

[+] ncmncm|6 years ago|reply

> most loops are actually implemented as branches.

No kidding!

But the article does call attention to a very important technique. It merits serious thought over all the ways it can be applied, that may not look much like this one, on the surface.

[+] ithkuil|6 years ago|reply

> no kidding

Well, that also means that some loops are fully unrolled :-)

[+] sifar|6 years ago|reply

Most DSPs have zero overhead loops (still branches) which don't incur a branch penalty. So something like below may take 1-2 cycles atmost, per iteration.

    for(i=0;i<howmany;i++) {
       
       out[index] = random();

    }

Ofcourse variable length loops would still have the branches.

[+] XJ6|6 years ago|reply

And just like that, a micro-optimisation has introduced a bug.

If the last number generated is even, it will still appear in the result set in the new code, and not in the old code.

[+] SloopJon|6 years ago|reply

It strikes me that the example is a bit contrived: why is howmany decremented unconditionally? The first loop populates out with howmany numbers. The second loop may set none (other than the ignored value that you point out). I suppose you can use the same trick:

  howmany -= (val bitand 1);

But that might complicate the benchmark.

Coincidentally, this was the subject of the first Stack Overflow question Bjarne addressed in an interview the other day:

https://stackoverflow.com/questions/11227809/why-is-processi...

[+] lifthrasiir|6 years ago|reply

The code assumes that, once the loop finishes, out[0] through out[index-1] contains the desired output. out[index] is not a part of that output.

[+] inetknght|6 years ago|reply

> And just like that, a micro-optimisation has introduced a bug.

If only the function had a unit test to catch such bugs

[+] naikrovek|6 years ago|reply

This is one of those things that is completely lost on someone who has never written in a low level language. I automatically assume JavaScript developers to be completely oblivious to this entire class of software development knowledge.

It is important to understand your platform all the way down to the CPU, including things like branch prediction and caches if you want to have performant software.

Software has been getting slower more rapidly than hardware has been getting faster for nearly a decade, and the free performance gains that software people have taken advantage of when better hardware is made available are just about exhausted.

It's time to learn your platforms, software people.

[later addition] This kind of thing is also one of the reasons that teaching OOP principles as they are taught today is so bad for software performance. Modeling object relationships to match the real world will, in every non-toy program, produce object structures that are actively unfriendly to cache efficiency, and will therefore produce software which performs very poorly when compared to software that was written with the hardware platform in mind.

[+] meheleventyone|6 years ago|reply

> This is one of those things that is completely lost on someone who has never written in a low level language. I automatically assume JavaScript developers to be completely oblivious to this entire class of software development knowledge.

This is a shame because not only are there lots of developers who write JS that have low-level backgrounds there are also a lot who haven't and are still interested. It seems rather snobby to jump in and write off a bunch of people because of the language they use on the assumption that they use the language without being aware of the implications.

On OOP versus more cache friendly approaches and branch prediction this talk from CPPCon 2019 is great: https://www.youtube.com/watch?v=HG6c4Kwbv4I

The interesting result is that although the DoD approach is eventually faster a branch misprediction causes havoc until uncovered.

[+] gildas|6 years ago|reply

> I automatically assume JavaScript developers to be completely oblivious to this entire class of software development knowledge.

Surprisingly, In JavaScript the technique described in the article is quite efficient on Firefox whereas the gain is almost negligible on Chrome.

https://jsperf.com/mispredicted-branches

Edit: Jsperf seeems to be down. Here are the 2 snippets of code I tested:

    // Unoptimized
    let howmany = 5000000;
    let index = 0;
    const out = [];
    while (howmany) {
        const val = Math.floor(Number.MAX_SAFE_INTEGER * Math.random());
        if (val % 2) {
          out[index] = val;
          index += 1;
        }
        howmany--;
    }


    // Optimized for branch prediction
    let howmany = 5000000;
    let index = 0;
    const out = [];
    while (howmany) {
        const val = Math.floor(Number.MAX_SAFE_INTEGER * Math.random());
        out[index] = val;
        index += (val & 1);
        howmany--;
    }

[+] vsareto|6 years ago|reply

>It is important to understand your platform all the way down to the CPU, including things like branch prediction and caches if you want to have performant software.

Let's not drastically increase job requirements for no good reason.

>It's time to learn your platforms, software people.

Many of these platforms have undocumented CPU instructions, so until you get a full accounting of that, what's the point? You can't learn the platform fully if they keep that a secret.

Secondly, we've had CPU-level issues like spectre and meltdown introduced that affected performance in some cases. We can't even trust the platform makers to get it right!

[+] DrScientist|6 years ago|reply

And if your platform is a virtual machine or interpreter running across a number of chip archs?

Perhaps it would be better to simply have appropriate tests and benchmarks to see what's slow on what platform?

Otherwise it's guess work based on incomplete understanding of the many layers below.

[+] pjmlp|6 years ago|reply

And when you outsmart the compiler, the CPU gets a microcode update and those clever micro-optimizations are thrown out of the window.

[+] dspillett|6 years ago|reply

> It is important to understand your platform all the way down to the CPU, including things like branch prediction and caches if you want to have performant software.

I would agree it can be useful to understand the lower levels, for the majority of developers macro optimisations are far far more important. Things like not forcing unnecessary redraws client-side, or excess DOM manipulations in general, caching values and references instead of rederiving them on each iteration of a loop, avoiding excess database hits, not using a naive sorting method, being aware of network latency issues, not properly indexing the database, and so on, will dwarf the effect of micro-optimisations for branch prediction and CPU caching if you get them wrong.

> understand your platform all the way down to the CPU

Which platform though? Not everyone is a back-end or embedded dev with constrained hardware to support. Are you targetting amd64 or ARM or something else? Any particular generation of CPU? In many cases an optimisation for one will have no effect elsewhere or worse will make others slower. Unless you are doing much tight-loop number crunching, is faffing around at this level really worth your time? Some understanding of cache concepts will help with overall algorithm design but most of what that knowledge will help you with generally (rather than for specific CPUs) is good practise for other reasons too.

> JavaScript developers

JS and other JIT compiled languages make tweaks for specific platforms even less important. For the most part let the optimising compiler worry about that for the platform it is compiling for at the time or consider a lower level language.

> never written in a low level language

I've written in assembly in the distant past (6502 & related, Z80, early x86 and a little of later x86). I did once used knowledge of cache sizes and population behaviour to optimise a little image processing (having the routine work on blocks of the pixel data that were small enough to remain in cache for longer than they would if not dividing into smaller chunks or if jumping around more randomly) done in inline assembly because Delphi's compiler produced a massively less optimal result if it wasn't.

I have enough low-level knowledge to feel safe saying I know that most developers (especially junior devs away from embedded (or other low-resource) systems) don't really need it, at least not in as much detail, in the current multi-platform world. They aren't working on code where the benefit of optimisations enabled by that knowledge aren't dwarfed by many other factors.

[+] rhinoceraptor|6 years ago|reply

How is it not a good thing that you don't need to worry about CPU branch prediction when you're writing Javascript to POST a form to your server?

In most cases you don't need to go that deep to fix performance issues. Your SPA is probably slow because your app bundle is 10mb, not because you have an inner loop that is tripping the branch predictor.

[+] amelius|6 years ago|reply

Instead of peephole optimizing your code, I think more can be gained by ensuring that your code doesn't perform useless computations. For example, a game that recomputes the entire scene every frame, or a UI that recomputes an entire virtual-dom tree after every user action.

[+] hamilyon2|6 years ago|reply

What good software modeling techniques could one learn instead of OOP?