top | item 34981027

Bitwise Division

167 points| stek29 | 3 years ago |h14s.p5r.org

74 comments

order
[+] roger10-4|3 years ago|reply
Horner’s method [0] is also commonly used on small microcontrollers without divide (or multiply) instructions.

[0] https://www.ti.com/lit/an/slaa329a/slaa329a.pdf

[+] kjs3|3 years ago|reply
Look out. Any time someone brings up microcontroller multiply, there's a someone who is compelled to chime in with "Well I've never needed to multiply on a microcontroller, so why do you need to!".
[+] dev_tty01|3 years ago|reply
Or when the controller has a divide but it is painfully slow.
[+] mgaunard|3 years ago|reply
The compiler is already reducing integer division by a constant into these things.

Those algorithms become more important when you're dividing by a value known at runtime but which remains the same during parts of the program. That's where libdivide comes in.

[+] mncharity|3 years ago|reply
libdivide tl;dr:

> libdivide allows you to replace expensive integer divides with comparatively cheap multiplication and bitshifts. Compilers usually do this, but only when the divisor is known at compile time. libdivide allows you to take advantage of it at runtime. The result is that integer division can become faster - a lot faster. [...] divide SIMD vectors by runtime constants, [1]

> libdivide.h is a header-only C/C++ library for optimizing integer division. Integer division is one of the slowest instructions on most CPUs e.g. on current x64 CPUs a 64-bit integer division has a latency of up to 90 clock cycles whereas a multiplication has a latency of only 3 clock cycles. libdivide allows you to replace expensive integer divsion instructions by a sequence of shift, add and multiply instructions that will calculate the integer division much faster.

> On current CPUs you can get a speedup of up to 10x for 64-bit integer division and a speedup of up to to 5x for 32-bit integer division when using libdivide. libdivide also supports SSE2, AVX2 and AVX512 vector division which provides an even larger speedup. You can test how much speedup you can achieve on your CPU using the benchmark program.[2]

[1] https://libdivide.com/ [2] https://github.com/ridiculousfish/libdivide

[+] eru|3 years ago|reply
See https://godbolt.org/z/3Yqbceaza for what godbolt says clang produces.

Keep in mind that this doesn't use the fact that we know that the input is between 0 to 63.

[+] Waterluvian|3 years ago|reply
I learned about shifting to divide or multiply when writing my game boy emulator. Something I find fascinating is the lack of precision needed for games of that fidelity.

Say you’re animating a jump and the character slows down towards the apex. I’ve seen an approach where it’s just “every X frames halve the speed until it’s zero.” This is done with the `SLA` (shift-left, LSB to 0) and `SRL` (shift-right, MSB to 0)` operations, which are very fast (8 cycles!) The bonus is that these opcodes will set the Zero Carry flag to 0 if the result is zero, so you know when your jump animation (up) is done if there is no ceiling collision, without having to check the value of the byte.

(I’ve also seen a set of hard coded speeds to animate through, and cases where they don’t change the speed at all, making the animation feel very rigid. It’s amazing what little details are important)

[+] grishka|3 years ago|reply
I also wrote a game boy emulator recently with the purpose of educating myself. I haven't looked that deep into what games do wrt math, but I was really impressed with all the creative ways they abuse the PPU by changing its various registers mid-frame. "Batman return of the Joker" for example uses this interesting effect of "compressing" the upper portion of the screen when you enter the options menu. It does this by changing the scroll Y to specific values on specific lines.

Though I do have some math-related problems in there. I spent several hours on it but I couldn't figure out how the SBC instruction should affect the carry flags. No matter what I do I can't pass the test ROMs.

But even with that, many of my childhood games are 100% playable. Writing a video game console emulator is a rewarding experience!

[+] vanderZwan|3 years ago|reply
> And it happens to be the case that 7 × 9 = 63, which is almost 64, a power of 2

While the article goes down a different rabbit hole, this also happens to be a neat example of (n-1)(n+1) = n² - 1, where n = 8.

Which is one of those things I haven't managed to find an actual use for but feels like it should have since you can use it with 0xFF (15 × 17), 0xFFFF (255 × 257), and so on.

[+] gus_massa|3 years ago|reply
It looks nicer in hexa: 0xFF = 0xF*0x11 and 0xFFFF = 0xFF*0x101
[+] marginalia_nu|3 years ago|reply
You can derive a relationship between division and shift-add-multiply in a cute way by noting that

   1       1        1   1 
  ---  -  ---  =   --- --- (a-b)
   b       a        b   a

Move 1/a to the right hand side

   1      1     1   1 
  ---  = --- + --- --- (a-b)
   b      a     b   a  

               [X]
Substitute 1/b marked with X for the RHS and you get

   1     1    (a-b)       1   1
  --- = --- + ----- (1 + --- --- (a-b))
   b     a     a^2        b   a

Repeat and eventually you get

      __
  1   \    -(n+1)    n
 --- = >  a     (a-b)
  b   /_  
      n=0...inf
For example with b=3 and a=2, you get

  1     1     1     1
 --- = --- - --- + --- -+ ... 
  3     2     4     16
Word of warning though, this method tend to produce nasty carry errors.
[+] pestatije|3 years ago|reply
(63 - nlz) / 7, where nlz is between 0 and 63

lookup table with 64 entries anyone?

[+] dahart|3 years ago|reply
It might be slower to use a memory access, even if you assume an average L1 cache hit. You can pipeline the bitwise version and maybe achieve an amortized divide in one cycle, where (depending on process and many other things) the lookup table might peak at like 4 cycles per divide even in unrolled/parallel situations. Plus the technique in the article is good for just about any constant divisor and if generalized works on much larger ranges, it’s why many compilers use this trick when they can.
[+] hoseja|3 years ago|reply
A bitshift oneliner IS nicer.
[+] cormacrelf|3 years ago|reply
Don’t modern compilers do this automatically (and aggressively) for almost any division by a constant?
[+] gary_0|3 years ago|reply
Sometimes, but not always.

In cases where you need a floored result, or need it rounded in a certain direction, or know that the divisor is always positive, the compiler will often give you sub-optimal assembly. And sometimes the compiler just randomly fails to inline or constant-fold and outputs a big fat IDIV.

Also, if you have to debug with optimizations disabled, the compiler will give you deliberately garbage code, which can make the program you're debugging unusably slow. So you often end up hand-optimizing for that case.

Of course, this depends on where the hot path is, but I've had to do a lot of optimization for code that gets run billions of times per second. I used to think compilers were really smart, but after staring at enough assembly output and learning all their tricks, they don't seem that smart anymore. Especially Microsoft's compiler; I've seen it output redundant division instructions!

[+] vintermann|3 years ago|reply
The author mentions that this is for a variable length (i.e. bigint) format. Sure, I bet mature bigint libraries do stuff like this automatically when sensible, but I still think it's interesting to read about someone figuring out such things on their own.
[+] RagnarD|3 years ago|reply
Wouldn’t a lookup table be a whole lot easier? I’d think a 64 entry table of integers would be fast to access once cached.
[+] fweimer|3 years ago|reply
It's possible to use a lookup table (sort of) in a register. With x86-64-v2 and later, the code is larger, but potentially faster due to the shorter dependency chain:

int div7 (int x) { return 9 - __builtin_popcountll (0x4081020408102040ULL >> x); }

https://godbolt.org/z/Wj6P6jYGM

I think any monotonic function on 0 .. 63 can be written this way, so should be possible to fold in the outer 63 - nlz expression, too.

[+] gary_0|3 years ago|reply
I don't use lookup tables that much when optimizing these days, because most math instructions cost far fewer cycles than a cache miss. And even if your lookup table is in L3, it still costs ~40 cycles to retrieve each line.
[+] dahart|3 years ago|reply
Maybe easier, if limited to the specific case of an input in the range [0..63] and a divisor of 7, but not necessarily faster. Keep in mind the latency of a memory read is typically around 4 cycles, and at least some architectures have a tighter limit on the number of concurrent memory transactions than the number of concurrent integer math instructions, and keep in mind the article is generalizing beyond the initial problem involving the numbers 63 and 7- if you have more than a couple different divisors, and if you want to allow the input to go as large as possible, then a lookup table quickly becomes wasteful and impractical.
[+] vintermann|3 years ago|reply
The divisor is from 0 to 63. The dividend is a variable bit length integer, i.e. a bigint.
[+] phkahler|3 years ago|reply
My goto would have been multiplication by (1/7)*2^16 followed by a right shift. You still need to verify the constant (or can add another after the multiply) to get it to work in every case. Used this once to extract digits of a 16bit integer from left to right.
[+] eru|3 years ago|reply
Your C compiler will do this for you automatically. See eg https://godbolt.org/z/3Yqbceaza

(I don't know what language is used here. I'm just assuming something like C.)

[+] jsd1982|3 years ago|reply
Very nice description of the bitwise maths, however I'm curious what lead to the need for `(63 - nlz) / 7` in the first place. A divisor of 7 is an odd thing to find when dealing with power-of-2-sized integers. Maybe this expression is used in some sort of expected length calculation to find the number of bytes which would be used for preallocating buffers? In most variable length encoder loops I've seen, you would just do a few bit twiddles to compute your encoded byte values and your exit condition should be a trivial check if there are remaining non-zero bits.
[+] cornstalks|3 years ago|reply
OP mentions "playing around with variable-length integer encodings."

`(63 - nlz) / 7` presumably tells you how many bytes you need to read for the varint. The length could be signaled by the number of leading zeros in the first byte (nlz), and each subsequent byte could set a high continuation bit (à la UTF-8 and protobuf varints), thus providing 7 bits of information per byte.

I'm totally speculating but this is generally what I'd expect from the expression `(63 - nlz) / 7` in the context of varints. Continuation bits and leading-zero-counters are redundant but this wouldn't be the first time I've seen something encode redundant information.

[+] jameshart|3 years ago|reply
Nice, if a little bit hand wavy. Seems a little bit of a stretch to call an operation that still includes a multiply bitwise, though.
[+] alpaca128|3 years ago|reply
The final non-generalized version uses two additions and two bitwise shifts, no multiplication:

    (v + (v << 3) + 9) >> 6
[+] dahart|3 years ago|reply
Naw, it’s standard to call such tricks “bitwise” even when including a multiply. Remember you’re multiplying into a specific bit range and then shifting down (dividing by a power of 2) to capture the bits you want, it’s bitwise in a very literal sense. Probably quite fair to call any expression “bitwise” if any single operation in the expression is bitwise, regardless of the other operators & functions, no? What part is hand-wavy? Variations of this technique are in standard widespread usage in the compilers we use.
[+] eru|3 years ago|reply
I wonder whether the author ever ran a benchmark?

Alas, the commenting system on their website seems broken, so can't ask there.

[+] plesner|3 years ago|reply
Author her. I didn't run benchmarks. I'm suspicious of micro-benchmarks and I don't have a context where I can try it against realistic data. Also, I just enjoy the maths of it even if it turns out not to make a huge performance difference in practice.
[+] cabirum|3 years ago|reply
Just precompute a list of values?
[+] victor106|3 years ago|reply
I always have a hard time grooking bit wise operations. Any good resources?
[+] PhilipRoman|3 years ago|reply
"Hacker's Delight" book was really useful for me even though I already had a good grasp on bitwise operators. I sat down with a pencil and went through the bit matrix transpose example and it really opened my eyes to how much interesting stuff can be done this way. I think the key to understanding is to visualize the bitwise operations.

Most of it is pattern recognition and learning common idioms, like multiplying by 0x01010101..., extracting substrings of bits ((x >> y) & z), thinking of values as sets of integers and so on.

Also bitwise manipulations lend themselves well to function composition, so try to keep an eye out for that. For example, if you have a fast function that checks if any byte is zero, you can generalize it to any value by applying xor beforehand. Similarly, the fast bit matrix transpose (which absolutely terrified me) is simply applying the least number of "shift a subset of bits" operations such that each bit travels the required distance.

[+] Icathian|3 years ago|reply
Honestly just practice. They cost me an interview I really wanted once, so now anytime I stumble across one in a project I take extra time to read and understand it. I also make a point of using bitmasks as function arguments in places where that's a good/reasonable choice.
[+] earlyam|3 years ago|reply
Start counting in binary on your fingers until you're just about "binary native". Also, perform operations, and think about why they behave the way they do.

It sounds a little woo-woo but going from representing something on paper (or worse, a display) to doing it with my hands really changed it for me. Humans likely evolved intelligence because we stood up and used our hands to manipulate our environment, I suspect we more fully engage our brains in important ways by doing something with our hands when we can.

You might start appreciating the imperial system of measurement though. You've been warned.

[+] Aardwolf|3 years ago|reply
Think about all tricks you can so in decimal, but in base 2 instead.

E.g in decimal you can multiply by ten by appending a zero, so in binary you multiply by 2 by doing so. And left shifting by 1 is what appends a zero.

Or you can take approximate log base ten in decimal by counting amount of digits, so in binary you can approximate log2 this way (counting up to most significant one-bit). Etc...

[+] secondcoming|3 years ago|reply
Very cool and it must have been really satisfying to have figured it out.
[+] edf13|3 years ago|reply
ChatGPT tells me:

Dividing a number by 7 using bitwise operations can be done using a technique called "magic numbers." This involves precomputing a number that, when combined with bitwise operations, will give the correct result.

Here's how it can be done:

    First, we need to find a "magic number" that will help us divide by 7. One way to do this is to use the fact that 7 is a prime number, which means that there is a number k such that 2^k ≡ -1 (mod 7). This number is called the "multiplicative inverse" of 2 modulo 7. We can find k by repeatedly squaring 2 until we get -1 modulo 7:
lua

2^1 ≡ 2 (mod 7) 2^2 ≡ 4 (mod 7) 2^3 ≡ 1 (mod 7) 2^4 ≡ 2 (mod 7) 2^5 ≡ 4 (mod 7) 2^6 ≡ 1 (mod 7)

So k = 3, since 2^3 ≡ 1 (mod 7).

    Next, we need to combine this magic number with bitwise operations. Specifically, we can use the following formula to divide a number x by 7:
scss

x/7 = (x + (x >> 3)) * k

Here, >> is the bitwise right shift operator, which shifts the bits of x to the right by 3 places, effectively dividing x by 8. The expression (x + (x >> 3)) adds these two values together, which effectively multiplies x by 9. Multiplying by k then gives the correct result of dividing by 7.

Here's some example code in Python that implements this division algorithm:

python

def div_by_7(x): k = 3 return (x + (x >> 3)) * k

Note that this algorithm assumes that the input number x is a non-negative integer. It may not work correctly for negative numbers or floating-point numbers.

[+] byteware|3 years ago|reply
so... 7/7 = (7+(7>>3))*3 = 21? gotcha