Look out. Any time someone brings up microcontroller multiply, there's a someone who is compelled to chime in with "Well I've never needed to multiply on a microcontroller, so why do you need to!".
The compiler is already reducing integer division by a constant into these things.
Those algorithms become more important when you're dividing by a value known at runtime but which remains the same during parts of the program. That's where libdivide comes in.
> libdivide allows you to replace expensive integer divides with comparatively cheap multiplication and bitshifts. Compilers usually do this, but only when the divisor is known at compile time. libdivide allows you to take advantage of it at runtime. The result is that integer division can become faster - a lot faster. [...] divide SIMD vectors by runtime constants, [1]
> libdivide.h is a header-only C/C++ library for optimizing integer division. Integer division is one of the slowest instructions on most CPUs e.g. on current x64 CPUs a 64-bit integer division has a latency of up to 90 clock cycles whereas a multiplication has a latency of only 3 clock cycles. libdivide allows you to replace expensive integer divsion instructions by a sequence of shift, add and multiply instructions that will calculate the integer division much faster.
> On current CPUs you can get a speedup of up to 10x for 64-bit integer division and a speedup of up to to 5x for 32-bit integer division when using libdivide. libdivide also supports SSE2, AVX2 and AVX512 vector division which provides an even larger speedup. You can test how much speedup you can achieve on your CPU using the benchmark program.[2]
I learned about shifting to divide or multiply when writing my game boy emulator. Something I find fascinating is the lack of precision needed for games of that fidelity.
Say you’re animating a jump and the character slows down towards the apex. I’ve seen an approach where it’s just “every X frames halve the speed until it’s zero.” This is done with the `SLA` (shift-left, LSB to 0) and `SRL` (shift-right, MSB to 0)` operations, which are very fast (8 cycles!) The bonus is that these opcodes will set the Zero Carry flag to 0 if the result is zero, so you know when your jump animation (up) is done if there is no ceiling collision, without having to check the value of the byte.
(I’ve also seen a set of hard coded speeds to animate through, and cases where they don’t change the speed at all, making the animation feel very rigid. It’s amazing what little details are important)
I also wrote a game boy emulator recently with the purpose of educating myself. I haven't looked that deep into what games do wrt math, but I was really impressed with all the creative ways they abuse the PPU by changing its various registers mid-frame. "Batman return of the Joker" for example uses this interesting effect of "compressing" the upper portion of the screen when you enter the options menu. It does this by changing the scroll Y to specific values on specific lines.
Though I do have some math-related problems in there. I spent several hours on it but I couldn't figure out how the SBC instruction should affect the carry flags. No matter what I do I can't pass the test ROMs.
But even with that, many of my childhood games are 100% playable. Writing a video game console emulator is a rewarding experience!
> And it happens to be the case that 7 × 9 = 63, which is almost 64, a power of 2
While the article goes down a different rabbit hole, this also happens to be a neat example of (n-1)(n+1) = n² - 1, where n = 8.
Which is one of those things I haven't managed to find an actual use for but feels like it should have since you can use it with 0xFF (15 × 17), 0xFFFF (255 × 257), and so on.
It might be slower to use a memory access, even if you assume an average L1 cache hit. You can pipeline the bitwise version and maybe achieve an amortized divide in one cycle, where (depending on process and many other things) the lookup table might peak at like 4 cycles per divide even in unrolled/parallel situations. Plus the technique in the article is good for just about any constant divisor and if generalized works on much larger ranges, it’s why many compilers use this trick when they can.
Before there had been more complex instructions, doing entire multiplications or divisions at once, computers handled this by dedicated multiplication shift and division shift instructions.
Compare here for the DEC PDP-1 (1959) and related algorithms:
In cases where you need a floored result, or need it rounded in a certain direction, or know that the divisor is always positive, the compiler will often give you sub-optimal assembly. And sometimes the compiler just randomly fails to inline or constant-fold and outputs a big fat IDIV.
Also, if you have to debug with optimizations disabled, the compiler will give you deliberately garbage code, which can make the program you're debugging unusably slow. So you often end up hand-optimizing for that case.
Of course, this depends on where the hot path is, but I've had to do a lot of optimization for code that gets run billions of times per second. I used to think compilers were really smart, but after staring at enough assembly output and learning all their tricks, they don't seem that smart anymore. Especially Microsoft's compiler; I've seen it output redundant division instructions!
The author mentions that this is for a variable length (i.e. bigint) format. Sure, I bet mature bigint libraries do stuff like this automatically when sensible, but I still think it's interesting to read about someone figuring out such things on their own.
It's possible to use a lookup table (sort of) in a register. With x86-64-v2 and later, the code is larger, but potentially faster due to the shorter dependency chain:
I don't use lookup tables that much when optimizing these days, because most math instructions cost far fewer cycles than a cache miss. And even if your lookup table is in L3, it still costs ~40 cycles to retrieve each line.
Maybe easier, if limited to the specific case of an input in the range [0..63] and a divisor of 7, but not necessarily faster. Keep in mind the latency of a memory read is typically around 4 cycles, and at least some architectures have a tighter limit on the number of concurrent memory transactions than the number of concurrent integer math instructions, and keep in mind the article is generalizing beyond the initial problem involving the numbers 63 and 7- if you have more than a couple different divisors, and if you want to allow the input to go as large as possible, then a lookup table quickly becomes wasteful and impractical.
My goto would have been multiplication by (1/7)*2^16 followed by a right shift. You still need to verify the constant (or can add another after the multiply) to get it to work in every case. Used this once to extract digits of a 16bit integer from left to right.
Very nice description of the bitwise maths, however I'm curious what lead to the need for `(63 - nlz) / 7` in the first place. A divisor of 7 is an odd thing to find when dealing with power-of-2-sized integers. Maybe this expression is used in some sort of expected length calculation to find the number of bytes which would be used for preallocating buffers? In most variable length encoder loops I've seen, you would just do a few bit twiddles to compute your encoded byte values and your exit condition should be a trivial check if there are remaining non-zero bits.
OP mentions "playing around with variable-length integer encodings."
`(63 - nlz) / 7` presumably tells you how many bytes you need to read for the varint. The length could be signaled by the number of leading zeros in the first byte (nlz), and each subsequent byte could set a high continuation bit (à la UTF-8 and protobuf varints), thus providing 7 bits of information per byte.
I'm totally speculating but this is generally what I'd expect from the expression `(63 - nlz) / 7` in the context of varints. Continuation bits and leading-zero-counters are redundant but this wouldn't be the first time I've seen something encode redundant information.
Naw, it’s standard to call such tricks “bitwise” even when including a multiply. Remember you’re multiplying into a specific bit range and then shifting down (dividing by a power of 2) to capture the bits you want, it’s bitwise in a very literal sense. Probably quite fair to call any expression “bitwise” if any single operation in the expression is bitwise, regardless of the other operators & functions, no? What part is hand-wavy? Variations of this technique are in standard widespread usage in the compilers we use.
Author her. I didn't run benchmarks. I'm suspicious of micro-benchmarks and I don't have a context where I can try it against realistic data. Also, I just enjoy the maths of it even if it turns out not to make a huge performance difference in practice.
"Hacker's Delight" book was really useful for me even though I already had a good grasp on bitwise operators. I sat down with a pencil and went through the bit matrix transpose example and it really opened my eyes to how much interesting stuff can be done this way. I think the key to understanding is to visualize the bitwise operations.
Most of it is pattern recognition and learning common idioms, like multiplying by 0x01010101..., extracting substrings of bits ((x >> y) & z), thinking of values as sets of integers and so on.
Also bitwise manipulations lend themselves well to function composition, so try to keep an eye out for that. For example, if you have a fast function that checks if any byte is zero, you can generalize it to any value by applying xor beforehand. Similarly, the fast bit matrix transpose (which absolutely terrified me) is simply applying the least number of "shift a subset of bits" operations such that each bit travels the required distance.
Honestly just practice. They cost me an interview I really wanted once, so now anytime I stumble across one in a project I take extra time to read and understand it. I also make a point of using bitmasks as function arguments in places where that's a good/reasonable choice.
Start counting in binary on your fingers until you're just about "binary native". Also, perform operations, and think about why they behave the way they do.
It sounds a little woo-woo but going from representing something on paper (or worse, a display) to doing it with my hands really changed it for me. Humans likely evolved intelligence because we stood up and used our hands to manipulate our environment, I suspect we more fully engage our brains in important ways by doing something with our hands when we can.
You might start appreciating the imperial system of measurement though. You've been warned.
Think about all tricks you can so in decimal, but in base 2 instead.
E.g in decimal you can multiply by ten by appending a zero, so in binary you multiply by 2 by doing so. And left shifting by 1 is what appends a zero.
Or you can take approximate log base ten in decimal by counting amount of digits, so in binary you can approximate log2 this way (counting up to most significant one-bit). Etc...
Dividing a number by 7 using bitwise operations can be done using a technique called "magic numbers." This involves precomputing a number that, when combined with bitwise operations, will give the correct result.
Here's how it can be done:
First, we need to find a "magic number" that will help us divide by 7. One way to do this is to use the fact that 7 is a prime number, which means that there is a number k such that 2^k ≡ -1 (mod 7). This number is called the "multiplicative inverse" of 2 modulo 7. We can find k by repeatedly squaring 2 until we get -1 modulo 7:
Next, we need to combine this magic number with bitwise operations. Specifically, we can use the following formula to divide a number x by 7:
scss
x/7 = (x + (x >> 3)) * k
Here, >> is the bitwise right shift operator, which shifts the bits of x to the right by 3 places, effectively dividing x by 8. The expression (x + (x >> 3)) adds these two values together, which effectively multiplies x by 9. Multiplying by k then gives the correct result of dividing by 7.
Here's some example code in Python that implements this division algorithm:
python
def div_by_7(x):
k = 3
return (x + (x >> 3)) * k
Note that this algorithm assumes that the input number x is a non-negative integer. It may not work correctly for negative numbers or floating-point numbers.
[+] [-] roger10-4|3 years ago|reply
[0] https://www.ti.com/lit/an/slaa329a/slaa329a.pdf
[+] [-] kjs3|3 years ago|reply
[+] [-] dev_tty01|3 years ago|reply
[+] [-] mgaunard|3 years ago|reply
Those algorithms become more important when you're dividing by a value known at runtime but which remains the same during parts of the program. That's where libdivide comes in.
[+] [-] mncharity|3 years ago|reply
> libdivide allows you to replace expensive integer divides with comparatively cheap multiplication and bitshifts. Compilers usually do this, but only when the divisor is known at compile time. libdivide allows you to take advantage of it at runtime. The result is that integer division can become faster - a lot faster. [...] divide SIMD vectors by runtime constants, [1]
> libdivide.h is a header-only C/C++ library for optimizing integer division. Integer division is one of the slowest instructions on most CPUs e.g. on current x64 CPUs a 64-bit integer division has a latency of up to 90 clock cycles whereas a multiplication has a latency of only 3 clock cycles. libdivide allows you to replace expensive integer divsion instructions by a sequence of shift, add and multiply instructions that will calculate the integer division much faster.
> On current CPUs you can get a speedup of up to 10x for 64-bit integer division and a speedup of up to to 5x for 32-bit integer division when using libdivide. libdivide also supports SSE2, AVX2 and AVX512 vector division which provides an even larger speedup. You can test how much speedup you can achieve on your CPU using the benchmark program.[2]
[1] https://libdivide.com/ [2] https://github.com/ridiculousfish/libdivide
[+] [-] eru|3 years ago|reply
Keep in mind that this doesn't use the fact that we know that the input is between 0 to 63.
[+] [-] Waterluvian|3 years ago|reply
Say you’re animating a jump and the character slows down towards the apex. I’ve seen an approach where it’s just “every X frames halve the speed until it’s zero.” This is done with the `SLA` (shift-left, LSB to 0) and `SRL` (shift-right, MSB to 0)` operations, which are very fast (8 cycles!) The bonus is that these opcodes will set the Zero Carry flag to 0 if the result is zero, so you know when your jump animation (up) is done if there is no ceiling collision, without having to check the value of the byte.
(I’ve also seen a set of hard coded speeds to animate through, and cases where they don’t change the speed at all, making the animation feel very rigid. It’s amazing what little details are important)
[+] [-] grishka|3 years ago|reply
Though I do have some math-related problems in there. I spent several hours on it but I couldn't figure out how the SBC instruction should affect the carry flags. No matter what I do I can't pass the test ROMs.
But even with that, many of my childhood games are 100% playable. Writing a video game console emulator is a rewarding experience!
[+] [-] Rediscover|3 years ago|reply
[0] https://dspace.mit.edu/handle/1721.1/6086
[1] https://en.m.wikipedia.org/wiki/Hacker%27s_Delight
[+] [-] vanderZwan|3 years ago|reply
While the article goes down a different rabbit hole, this also happens to be a neat example of (n-1)(n+1) = n² - 1, where n = 8.
Which is one of those things I haven't managed to find an actual use for but feels like it should have since you can use it with 0xFF (15 × 17), 0xFFFF (255 × 257), and so on.
[+] [-] gus_massa|3 years ago|reply
[+] [-] marginalia_nu|3 years ago|reply
[+] [-] pestatije|3 years ago|reply
lookup table with 64 entries anyone?
[+] [-] dahart|3 years ago|reply
[+] [-] hoseja|3 years ago|reply
[+] [-] masswerk|3 years ago|reply
Compare here for the DEC PDP-1 (1959) and related algorithms:
[0] https://www.masswerk.at/spacewar/inside/insidespacewar-pt6-g...
[+] [-] cormacrelf|3 years ago|reply
[+] [-] gary_0|3 years ago|reply
In cases where you need a floored result, or need it rounded in a certain direction, or know that the divisor is always positive, the compiler will often give you sub-optimal assembly. And sometimes the compiler just randomly fails to inline or constant-fold and outputs a big fat IDIV.
Also, if you have to debug with optimizations disabled, the compiler will give you deliberately garbage code, which can make the program you're debugging unusably slow. So you often end up hand-optimizing for that case.
Of course, this depends on where the hot path is, but I've had to do a lot of optimization for code that gets run billions of times per second. I used to think compilers were really smart, but after staring at enough assembly output and learning all their tricks, they don't seem that smart anymore. Especially Microsoft's compiler; I've seen it output redundant division instructions!
[+] [-] vintermann|3 years ago|reply
[+] [-] RagnarD|3 years ago|reply
[+] [-] fweimer|3 years ago|reply
int div7 (int x) { return 9 - __builtin_popcountll (0x4081020408102040ULL >> x); }
https://godbolt.org/z/Wj6P6jYGM
I think any monotonic function on 0 .. 63 can be written this way, so should be possible to fold in the outer 63 - nlz expression, too.
[+] [-] gary_0|3 years ago|reply
[+] [-] dahart|3 years ago|reply
[+] [-] vintermann|3 years ago|reply
[+] [-] phkahler|3 years ago|reply
[+] [-] eru|3 years ago|reply
(I don't know what language is used here. I'm just assuming something like C.)
[+] [-] jsd1982|3 years ago|reply
[+] [-] cornstalks|3 years ago|reply
`(63 - nlz) / 7` presumably tells you how many bytes you need to read for the varint. The length could be signaled by the number of leading zeros in the first byte (nlz), and each subsequent byte could set a high continuation bit (à la UTF-8 and protobuf varints), thus providing 7 bits of information per byte.
I'm totally speculating but this is generally what I'd expect from the expression `(63 - nlz) / 7` in the context of varints. Continuation bits and leading-zero-counters are redundant but this wouldn't be the first time I've seen something encode redundant information.
[+] [-] jameshart|3 years ago|reply
[+] [-] alpaca128|3 years ago|reply
[+] [-] dahart|3 years ago|reply
[+] [-] deisner|3 years ago|reply
[+] [-] eru|3 years ago|reply
Alas, the commenting system on their website seems broken, so can't ask there.
[+] [-] plesner|3 years ago|reply
[+] [-] Cthulhu_|3 years ago|reply
[+] [-] cabirum|3 years ago|reply
[+] [-] victor106|3 years ago|reply
[+] [-] PhilipRoman|3 years ago|reply
Most of it is pattern recognition and learning common idioms, like multiplying by 0x01010101..., extracting substrings of bits ((x >> y) & z), thinking of values as sets of integers and so on.
Also bitwise manipulations lend themselves well to function composition, so try to keep an eye out for that. For example, if you have a fast function that checks if any byte is zero, you can generalize it to any value by applying xor beforehand. Similarly, the fast bit matrix transpose (which absolutely terrified me) is simply applying the least number of "shift a subset of bits" operations such that each bit travels the required distance.
[+] [-] Icathian|3 years ago|reply
[+] [-] earlyam|3 years ago|reply
It sounds a little woo-woo but going from representing something on paper (or worse, a display) to doing it with my hands really changed it for me. Humans likely evolved intelligence because we stood up and used our hands to manipulate our environment, I suspect we more fully engage our brains in important ways by doing something with our hands when we can.
You might start appreciating the imperial system of measurement though. You've been warned.
[+] [-] Aardwolf|3 years ago|reply
E.g in decimal you can multiply by ten by appending a zero, so in binary you multiply by 2 by doing so. And left shifting by 1 is what appends a zero.
Or you can take approximate log base ten in decimal by counting amount of digits, so in binary you can approximate log2 this way (counting up to most significant one-bit). Etc...
[+] [-] Reubensson|3 years ago|reply
[+] [-] secondcoming|3 years ago|reply
[+] [-] edf13|3 years ago|reply
Dividing a number by 7 using bitwise operations can be done using a technique called "magic numbers." This involves precomputing a number that, when combined with bitwise operations, will give the correct result.
Here's how it can be done:
lua2^1 ≡ 2 (mod 7) 2^2 ≡ 4 (mod 7) 2^3 ≡ 1 (mod 7) 2^4 ≡ 2 (mod 7) 2^5 ≡ 4 (mod 7) 2^6 ≡ 1 (mod 7)
So k = 3, since 2^3 ≡ 1 (mod 7).
scssx/7 = (x + (x >> 3)) * k
Here, >> is the bitwise right shift operator, which shifts the bits of x to the right by 3 places, effectively dividing x by 8. The expression (x + (x >> 3)) adds these two values together, which effectively multiplies x by 9. Multiplying by k then gives the correct result of dividing by 7.
Here's some example code in Python that implements this division algorithm:
python
def div_by_7(x): k = 3 return (x + (x >> 3)) * k
Note that this algorithm assumes that the input number x is a non-negative integer. It may not work correctly for negative numbers or floating-point numbers.
[+] [-] byteware|3 years ago|reply