top | item 45405750

The Weird Concept of Branchless Programming

170 points| judicious | 5 months ago |sanixdk.xyz

90 comments

order

Joker_vD|5 months ago

Just so you know, the "return x < 0 ? -x : x;" compiles into

    abs_branch:
        mov     eax, edi
        neg     eax
        cmovs   eax, edi
        ret
on x64, and into

    abs_branch:
        srai    a5,a0,31
        xor     a0,a5,a0
        sub     a0,a0,a5
        ret
on RISC-V if you use a C compiler with a half-decent codegen. And "branchy" clamp() translates into

    clamp:
        cmp     edi, edx
        mov     eax, esi
        cmovle  edx, edi
        cmp     edi, esi
        cmovge  eax, edx
        ret
Seriously, the automatic transformation between ?: and if-then-else (in both directions) is quite well studied by now. And if you try to benchmark difference between branching and branchless implementations, please make sure that the branches you expect are actually there in the compiler's output.

rcxdude|5 months ago

Yeah, if you care about branches for any reason then you need to at least verify the compiler's output, if not drop down to assembly completely.

zeckalpha|5 months ago

The inverse problem is here too:

> int partition_branchless(int* arr, int low, int high) {... for (int j = low; j < high; j++) { ... }... }

That for loop is just a sugared while loop which is just a sugared cmp and jmp

TuxSH|5 months ago

In fact, I have seen gcc optimize clever hacks that tried to use multiplication, into conditional moves (on aarch32).

There is this common misconception that conditional moves == branching. On actually relevant software architectures, they very much are not. Replacing a p=0.5 branch into a conditional move is in itself a significant optimization.

MontyCarloHall|5 months ago

This great video [0] demonstrates how CPU performance has only increased 1.5-2x over the past 15(!) years when executing extremely branchy code. Really shows you just how deep modern CPU pipelines have become.

The video also showcases a much more impressive branchless speedup: computing CRC checksums. Doing this naïvely with an if-statement for each bit is ~10x slower than doing it branchless with a single bitwise operation for each bit. The author of the article should consider showcasing this too, since it's a lot more impressive than the measly 1.2x speedup highlighted in the article. I assume the minimal/nonexistent speedups in the article are due to modern CPU branch prediction being quite good. But branch predictors inherently fail miserably at CRC because the conditional is on whether the input bit is 1 or 0, which is essentially random.

[0] https://www.youtube.com/watch?v=m7PVZixO35c

hinkley|5 months ago

Branchless is also useful for cryptographic transforms as it frustrates timing attacks. And that’s a situation where it only needs to be relatively fast compared to the branching alternative because we are trying to improve security while limiting the overhead of doing so.

SynasterBeiter|5 months ago

The linked video doesn't take into account power consumption of these CPUs. He seems to be comparing laptop CPUs, NUC CPUs and desktop CPUs. If you compare a 100W CPU and a 30W CPU that's a couple of years newer, you shouldn't be surprised there isn't much of a difference in performance.

sfilmeyer|5 months ago

I enjoyed reading the article, but I'm pretty thrown by the benchmarks and conclusion. All of the times are reported to a single digit of precision, but then the summary is claiming that one function shows an improvement while the other two are described as negligible. When all the numbers presented are "~5ms" or "~6ms", it doesn't leave me confident that small changes to the benchmarking might have substantially changed that conclusion.

gizmo686|5 months ago

Yeah. When your timing results are a single digit multiple of your timing precision, that is a good indication you either need a longer test, or a more precise clock.

At a 5ms baseline with millisecond precision, the smallest improvement you can measure is 20%. And you cannot distinguish a 20% speedup with a 20% slowdown that happened to get luck with clock ticks.

For what it is worth, I ran the provided test code on my machine with a 100x increase in iterations and got the following:

  == Benchmarking ABS ==
  ABS (branch):     0.260 sec
  ABS (branchless): 0.264 sec

  == Benchmarking CLAMP ==
  CLAMP (branch):     0.332 sec 
  CLAMP (branchless): 0.538 sec

  == Benchmarking PARTITION ==
  PARTITION (branch):     0.043 sec
  PARTITION (branchless): 0.091 sec
Which is not exactly encouraging (gcc 13.3.0, -ffast-math -march=native. I did not use the -fomit-this-entire-function flag, which my compiler does not understand).

I had to drop down to O0 to see branchless be faster in any case:

  == Benchmarking ABS ==
  ABS (branch):     0.743 sec
  ABS (branchless): 0.948 sec

  == Benchmarking CLAMP ==
  CLAMP (branch):     4.275 sec
  CLAMP (branchless): 1.429 sec

  == Benchmarking PARTITION ==
  PARTITION (branch):     0.156 sec
  PARTITION (branchless): 0.164 sec

Joel_Mckay|5 months ago

In general, modern compilers will often unroll or inline functions without people even noticing. This often helps with cache level state localization and parallelism.

Most code should focus on readability, then profile for busy areas under use, and finally refactor the busy areas though hand optimization or register hints as required.

If one creates something that looks suspect (inline Assembly macro), a peer or llvm build will come along and ruin it later for sure. Have a great day =3

FooBarBizBazz|5 months ago

I used to do stuff like this (ok, not half as smart), but stopped around 2013 or so, as the distinction between "implementation defined" behavior (ok) and "undefined" behavior (not ok) started to matter and bite.

After thinking through this carefully, though, I do not see UB (except for signed overflow in a corner case):

Step 1, bit shift.

I understand that, until C++20, left shift of a signed int was UB. But this right shift appears to be implementation defined, which is ok if documented in the code.

Step 2: Add.

Then, (x + mask) is defined behavior (a) for positive x, since then mask=0, and (b) for most negative x, since then mask=-1. However, if x is numeric_limits::lowest, then you get signed integer overflow, which is UB.

Step 3, XOR.

Then the final XOR doesn't have UB AFAICT. It wouldn't be UB as of C++20, when signed integers became two's complement officially. It might have been implementation defined before then, which would be almost as good for something that ubiquitous, but I'm not sure.

Ok, so I think this does not involve UB except for an input of numeric_limits_lowest.

Sound about right?

To fix this, perhaps you would need to make that + an unsigned one?

It bothers me how hard you need to think to do this language lawyering. C++ is a system of rules. Computers interpret rules. The language lawyer should be a piece of software. I get it, not everything can be done statically, so, fine, do it dynamically? UBSan comes closest in practice, but doesn't detect everything. I understand formally modeled versions of C and C++ have been developed commercially, but these are not open source so they effectively do not exist. It's a strange situation.

Just the other day I considered writing something branchless and said, "nah", because of uncertainty around UB. How much performance is being left on the table around the world because of similar thinking, driven by the existence of UB?

Maybe I was supposed to write OCaml or Pascal or Rust.

teo_zero|5 months ago

Well, I would say that implementation defined is ok only if you have full control on the full compilation process. If your code aims at universality you should find better tricks.

The UB on the add happens in cases where all incarnations of abs() would fail as well, because there simply isn't a correct return value.

wucke13|5 months ago

Rust in particular with miri is quite impressive at catching them. You just run your testcases via

    cargo miri run
And if your code actually touches UB, mirei will most likely point out exactly where and why.

adrianmonk|5 months ago

I've always wondered if any CPUs have tried to reduce the branch penalty by speculatively executing both ways at once in parallel. You'd have two of everything (two pipelines, two ALUs, two sets of registers, etc.) and when you hit a conditional branch, instead of guessing which way to go, you'd essentially fork.

Obviously that requires a lot of extra transistors and you are doing computation that will be thrown away, so it's not free in terms of space or power/heat/energy. But perhaps it could handle cases that other approaches can't.

Even more of a wild idea is to pair up two cores and have them work together this way. When you have a core that would have been idle anyway, it can shadow an active core and be its doppelganger that takes the other branch. You'd need to have very fast communication between cores so the shadow core can spring into action instantly when you hit a branch.

My gut instinct is it's not worth it overall, but I'm curious whether it's been tried in the real world.

jasonwatkinspdx|5 months ago

Yes, it's been looked at. If you wanna skim the research use "Eager Execution" and "Disjoint Eager Execution" as jumping off points.

It doesn't require duplicating everything. You just need to add some additional bookkeeping of dependencies and what to retire vs kill at the end of the pipeline.

In practice branch predictors are so good that speculating off the "spine" of most likely path just isn't worth it.

In fact there were a lot of interesting microarchitectural ideas from the late 90s to early 00s that just ended up being moot because the combination of OoO speculation, branch predictors, and big caches proved so effective.

hawk_|5 months ago

What has worked out very well in practice is hyper-threading. So you take instructions from two threads and if one of them is waiting on a branch the units of the CPU don't go unused.

recursivecaveat|5 months ago

They do this on FPGA a lot. Since you know statically the content of the branches, and you need to have resources there to run either of them, it is pretty low overhead to set them up to run in parallel and select the appropriate result afterwards.

Someone|5 months ago

> You'd have two of everything (two pipelines, two ALUs, two sets of registers, etc.)

As others said: yes, it has been tried and it works, but it costs a lot in hardware and power usage. A problem is that lots of code has a branch every 10 or so instructions. Fast high-end CPUs (the only realistic target for this feature) can dispatch multiple instructions per cycle. Combined that means you will hit a branch every two or three cycles. Because of that, you do not end up with two of everything but with way more.

So, you’re throwing away not 50% of your work but easily 80%.

Some code has fewer branches, but that often can easily be parallelized or vectorized.

mshockwave|5 months ago

yes, it has been done for at least a decade if not more

> Even more of a wild idea is to pair up two cores and have them work together this way

I don't think that'll be profitable, because...

> When you have a core that would have been idle anyway

...you'll just schedule in another process. Modern OS rarely runs short on available tasks to run

okaleniuk|5 months ago

Some safety critical real time systems have strict time predictability requirements. This means that the whole loop should pass in exactly X microsecond, not more not less. For this, all the programming should be pretty much branchless.

For instance, all the sorting algorithm turn to, effectively, bubble sort since without branches, you always go with the worst case - and the sorting complexity is always the O(n^2). But it's okay, since the algorithm becomes predictable. You just swap a conditional swap:

    if a < b then swap(a, b)
with arithmetic swap:

    c = (a+b)/2
    d = |a-b|/2
    a = c + d 
    b = c - d
and that's it.

ashdnazg|5 months ago

O(n^2) isn't required. One could do an in-place merge-sort, which is also always worst case, but with O(n*log(n)).

I suspect everyone turns to Bubblesort since the inputs are small enough that it doesn't matter (evident by the fact that it should fit within microseconds).

Legend2440|5 months ago

Article doesn’t mention this, but I’d consider neural networks a form of branchless programming. It’s all a bunch of multiply and threshold operations.

david-gpu|5 months ago

> It’s all a bunch of multiply and threshold operations.

Real-world high-performance matrix multiplication functions do contain branches internally, even on GPUs. If you are ever curious about what that looks like, NVidia maintains an open-source library called CUTLASS.

abyesilyurt|5 months ago

Thresholding requires branching no?

starchild3001|5 months ago

I’m a big advocate of branchless programming — keeping configurations to a minimum and maintaining as much linear flow as possible, with little to no cfg-driven branching.

Why? I once took over a massive statistics codebase with hundreds of configuration variables. That meant, in theory, upwards of 2^100 possible execution paths — a combinatorial explosion that turned testing into a nightmare. After I linearized the system, removing the exponential branching and reducing it to a straightforward flow, things became dramatically simpler. What had once taken years to stabilize, messy codebase, became easy to reason about and, in practice, guaranteed bug-free.

Some people dismissed the result as “monolithic,” which is a meaningless label if you think about it. Yes, the code did one thing and only one thing —- but it did that thing perfectly, every single time. It wasn’t pretending to be a bloated, half-tested “jack of all trades” statistics library with hidden modes and brittle edge cases.

I’m proud of writing branchless (or “monolithic” code if you prefer). To me, it’s a hallmark of programming maturity -- choosing correctness and clarity over endless configurability, complexity and hidden modes.

zackmorris|5 months ago

I need something like this for a switch() command (technically a list of arbitrary functions). Sort of like up to N branches in one step.

The idea is to take some number of inputs A, B, C, ... and conceptually perform all of the possible functions simultaneously, then keep the one that's desired and throw the rest away. For any arbitrary logic. Ideally using fewer operations than all of that, but that's optional. Driven by one variable, it would look like:

  // branch version
  switch(var1)
  {
    case 1:
      var4 = var2 + var3;
      break;
    case 2:
      var4 = var2 - var3;
      break;
    case 3:
      var4 = var2 * var3;
      break;
    case 4:
      var4 = var2 / var3;
      break;
    // ...
    default:
      var4 = 0; // optional and arbitrary
      break;
  }
  
  // branchless version
  var4 = magic(var1, var2, var3);
I don't know how to do this outside of programmable hardware like an FPGA. The problem is that it's extremely challenging to write/solve the formulas that map ordinary arithmetic and bitwise functions to the larger set of functions.

So for now, I may just use a switch() command in a shader and let it figure out any reusable logic internally. I don't know if shaders have come far enough to allow that performantly, or if they end up just calculating everything and throwing away the paths not taken. Which would suggest that the max speed would be O(1/N) for the number of cases.

Does anyone know? Maybe truth tables? Or a SAT solver? I'm also not sure how this would work for floating point, but that's optional too.

Edit: I updated the cases to show how var1 controls the math performed on var2, var3 and var4.

AnotherGoodName|5 months ago

Think of a way you can make the number of each case 0 or 1 then you have a single sum of all possibilities where that 0 or 1 multiplies the portion of the switch statement it represents (the 0 will mean not active the 1 includes it).

Eg. If we had just 1 or 2 as options for the case statements note that test1 = xor(var1 & 0x1, var1>>1 & 0x1) & var1 will, with integer rounding, equal ‘1’ if var1 is one or ‘0’ if var1 is 2,3 or 4. (If it goes to 5 you need another xor at the next highest bit).

Likewise test2 = xor(var1 & 0x1, var1>>1 & 0x1) & var1<<1 equals ‘0’ if var1 is 1, 3 or 4 but it equals ‘1’ if it’s two.

So (test1 * (value on var1 being 1) + (test2 * (value on var1 being 2)) will get you a branchless version for the simple either 1 or 2 case. This will pretty much always be much slower than the branching version but some types of systems out there don’t really do branching and are linear. This type of zeroing trick is really common on matrix multipliers.

Essentially your creating binary logic gates that give 0 or 1 for the exact value you want and blindly multiplying that by the value it would have produced if 1.

yccs27|5 months ago

The most general approach is to calculate all the different cases, and then do a branchless selection:

    var4 = (var1==1)*(var2+var3) | (var1==2)*(var2-var3) | ...
Of course this is basically the slowest possible option, but it might work as a starting point for a general-purpose optimizer to find a faster solution. If it doesn't, this will likely compile to conditional move instructions.

It might help if you replace the comparison and multiplication with an equivalent expression made from bitwise operations, but I believe most compilers already know how to do this transformation.

fwip|5 months ago

Maybe I'm missing something obvious, but how about:

    vars[1] = var2 + var3;
    vars[2] = var2 - var3;
    ...
    vars[0] = 0;
    var4 = vars[var1];

judicious|5 months ago

I’m by no means an expert on this topic, but what about using predefined hash map of value to function call/closure unless that also yields branched programming underneath? Additionally, maybe you want to use macros of some kind to generate the “magic” function you’re looking for, where it constructs a single branchless statement up to N terms, whether it’s procedural macros in Rust or C macros.

zackmorris|5 months ago

Thank you everyone for your answers, they got me thinking in new ways.

In case anyone ever finds this, I realized that a brute force solution would be a lookup table of all combinations. For three 8 bit variables, that's 2^24=16,777,216 or 16 MB of EPROM.

Then we could apply truth table simplification techniques like Karnaugh maps, sum of products (SOP), products of sums (POS), etc:

https://workforce.libretexts.org/Bookshelves/Electronics_Tec...

https://tma.main.jp/logic/index_en.html

https://karnaughmapsolver.com/

I asked AI, and if the bits approximate a pseudorandom pattern, then we would need approximately log2(2^24)=24 levels of 2-input boolean logic gates. For NAND or NOR gates, apparently that's something like 3x2^27=400 million gates. If we could manage 24-input gates somehow, that might be reduced to ceil(log2(24))=5 levels, but internally that would occupy about the same amount of die area. Perhaps using a hybrid of gates, multiplexers and EPROMS could result in a much smaller circuit at the cost of higher propagation delay (latency), which is the goal of FPGAs.

We could then play with the ordering of the control variable var1 to explore all 256 possibilities and see if any result in a reduced number of logic levels (by maximizing the size of Karnaugh map loops). I suspect that might only save a few levels at most though, but even that could cut the number of gates by 50-90%.

Thankfully the control variable var1 might only have 16-64 possibilities, which lowers it to 4-6 bits or just 2^20=1,048,576 to 2^22=4,194,304 or 1-4 MB of EPROM, but still about 3*2^25=100 million gates worst case.

For 16, 32 or 64 bit integers or floats, it might be better to just calculate all cases with dedicated hardware and select the right one.

rmnclmnt|5 months ago

Great article, triggers some memories.

When you get to think about branchless programming, especially for SIMD optimizations in the real world, you always learn a lot and it’s as if you get a +1 level on your algorithmic skills. The hardest part then is make sure the tricks are clearly laidout so that someone else can take it from here next time

kccqzy|5 months ago

It also triggered some memories for me too. A college professor wanted to teach all the bit manipulating stuff and gave an assignment where students had to transform branchy code into branchless code using shifts and bit operators. Had a lot of fun doing that.

stevefan1999|5 months ago

I kind of like branchless programming, because in SIMD programming, some of its concepts are directly borrowed from branchless techniques, for example using bit masks to represent which vector to enable scatter and gather for.

For example, I can give it an array A and a vector V and a mask which output another vector O, then if the specific position i in the bit mask is 1, then O[i] will pick this element from A[V[i]], otherwise just A[i].

In Python this may sound like [A[V[i]] if M[i] else V[i] for i in range(len(V))], so it is very branchy, but in SIMD this would just be a bunch of SIMD operations without branch!

Speaking of which, this is particularly informative about what "branchless programming" really are: https://en.m.wikipedia.org/wiki/Predication_(computer_archit...

If you want to learn about the ultimate form of branchless programming, check out Church encoding: https://en.m.wikipedia.org/wiki/Church_encoding and https://gautier.difolco.dev/2025-09/extreme-branchless-expr-...

dehrmann|5 months ago

Shameless plug, but a while back, I implemented a "branchless" (I think it actually branches, but not in the usual sense) binary search in C. It was just a POC to see if 1) I could be clever enough with bitwise operators to do it and 2) someday write a SIMD binary search.

https://github.com/ehrmann/branchless-binary-search

senfiaj|5 months ago

Interesting article. I remember have solved a leetcode problem called "Trapping Rain Water" https://leetcode.com/problems/trapping-rain-water/descriptio...

On leetcode the "elegant" and fastest solution was something like this:

  class Solution {
  public:
    int trap(vector<int>& height) {
        int i = 1;
        int n = height.size();
        int j = n - 1;
        int leftmax = height[0];
        int rightmax = height[n - 1];
        int tw = 0;
        while (i <= j) {
            if (leftmax <= rightmax) {
                if (leftmax > height[i]) {
                    tw += leftmax - height[i];
                } else {
                    leftmax = height[i];
                }
                i++;
            } else {
                if (rightmax > height[j]) {
                    tw += rightmax - height[j];
                } else {
                    rightmax = height[j];
                }
                j--;
            }
        }
        return tw;
    }
  };
My solution was this:

  class Solution {
  public:
    int trap(const vector<int>& height) {
        int mx = 0;
        long long sum = 0;

        for (const auto h : height) {
            mx = max(mx, h);
            sum += mx - h;
        }

        int rMx = 0, h, i;

        for (i = height.size() - 1; (h = height[i]) != mx; --i) {
            rMx = max(h, rMx);
            sum += rMx;
        }

        return sum - (height.size() - 1 - i) * mx;
    }
  };
However, I compiled and run some benchmarks on my local machine, and it turned out while on average my solution was slower, but in the worst case it was beating that solultion. My solution was less sensitive to the random input and the degradation was far less dramatic, around 2x, while that solution degraded much much more. I also thought that it was probably caused by the fact that my solution seem to be more predictable for the CPU branch predictor than that one, despite that it iterates twice.

ckw|5 months ago

My slow solution:

  class Solution:
    def trap(self, height: List[int]) -> int:
      l,r,s = [*accumulate(height,max),[*accumulate(height[::-1],max)],len(height)
      return sum(max(0, min(l[i], r[s-i-1])-height[i]) for i in range(s))

QuantumSeed|5 months ago

I was an APL programmer for many years and I insisted on writing as much of my code as possible without branches. Fortunately, it was reasonably easy to do in that language, if sometimes resulting in very cryptic code

HarHarVeryFunny|5 months ago

The ARM instruction set used to make all instructions conditional to avoid the need for branching, but has now moved away from that since the branch prediction is so good nowadays.

mshockwave|5 months ago

The article is easy to follow but I think the author missed the e point: branchless programming (a subset of the more known constant time programming) is almost exclusively used in cryptography only nowadays. As shown by the benchmarks in the article, modern branch predictors can easily achieve over 95% if not 99% precision since like a decade ago

vdupras|5 months ago

In the part about "abs", there's an assembly breakdown:

mov eax, edi

sar eax, 31

mov ecx, eax

add edi, ecx

xor eax, edi

Has this been generated by a C compiler? If yes, it's a bit puzzling, because can't you remove "mov ecx, eax", replace "add edi, ecx" by "add edi, eax" and have the exact same result?

senfiaj|5 months ago

This what was generated with clang / gcc x64 with O2/O3 flag on godbolt.org

  abs_branchless(int):
        mov     eax, edi
        neg     eax
        cmovs   eax, edi
        ret

  abs_branch(int):
        mov     eax, edi
        neg     eax
        cmovs   eax, edi
        ret

userbinator|5 months ago

If you look at compiler output, you will always see plenty of small stupidities.

klas_segeljakt|5 months ago

Would it be possible to create a programming language that allows you to write branching programs, but is guaranteed to compile it into branchless code?

rao-v|5 months ago

I’m amused to see a for loop in a function that is purportedly branchless

(not a critique)

bluGill|5 months ago

Loops are very predictable in general so the branch isn't a problem. Of course this assumes your branch codition is standard - you can do weird things in the end condition that would food the cpu.

Animats|5 months ago

This is just cutesy on CPUs, but is a big part of GPU programming.

itemize123|5 months ago

off topic, what are good resources to dive into gpu programming (for someone mostly in the cpu world)

nixpulvis|5 months ago

Branchless programming tactics also avoid security issues around timing side channels leaking predicate bits, which is somewhat interesting.

presentation|5 months ago

In the partition example isn’t the for loop condition a branch technically?

gtirloni|5 months ago

I think I'm getting too sensitive to AI-assisted articles. So many whimsical terms to make something "fun".

mwkaufma|5 months ago

Is cmov branchless, or just branching by another name?

senfiaj|5 months ago

From my understanding branches are about conditional jump instructions. Here are some of them:

  JZ, JE - Jump if Zero, Jump if Equal
  JNZ, JNE - Jump if Not Zero, Jump if Not Equal
  JC - Jump if Carry
  JNC - Jump if No Carry
  JO - Jump if Overflow
  JNO - Jump if No Overflow
  JS - Jump if Signed (Negative)
  JNS - Jump if Not Signed (Positive or Zero)
  JP, JPE - Jump if Parity, Jump if Parity is Even
  JNP, JPO - Jump if Not Parity, Jump if Parity is Odd

stevefan1999|5 months ago

That depends on how you define branch.

Say in Rust:

let foo = if bar { 1 } else { 2 };

And

let mut foo; if bar { foo = 1; } else { foo = 2; }

Despite they looked the same, functions the same and effectively the same, but the first one is the conditional move, and the second one would be a jump initially (until further compiler optimization kick in)

You will notice that for conditional move, you "get" a predictable expression for the result, but with branched jump, it's like you "get" a bunch of arbitrary statements, that writes to the expression. It may end up folding so both will essentially be compiled to cmov, but the way to representation of the assignment is different. You can be certain with conditional instructions, but you can't be certain with branched jump, otherwise we don't need branch prediction.

In fact, the way conditional instructions work is due to Church encoding, that you created a lambda function that calls the left or right function depending on the input evaluation, which can be seen as implicitly embedding the branch.

teo_zero|5 months ago

CMOV is branchless. During all steps of its execution, the CPU doesn't have to invalidate any stage of its pipeline (well, excluding errors like accessing forbidden memory addresses, etc.).

Scene_Cast2|5 months ago

Another interesting take is how GPUs handle branching (latest NVidia GPUs in particular - see the register interconnect and how they swap threads without saving registers).