In hardware this would be called a carry save adder [1].
One interesting fact about carry save adders is that the carry part of the register can be extended in order to avoid any carrying between words for very long periods of time. Instead of using 53 bits to represent 52 bits, use 60 bits, and now you can perform 256 sequential additions with no carrying between words before the carry part saturates and needs to be handled.
Somewhat surprisingly, it's even possible to use this construction in the context of a quantum computation [2][3]. The reason it's surprising is because tacking on additional registers like this can act as an information leakage mechanism (which is an error in a quantum computation). For analogy, you can imagine that if you computed a public key using this trick that you would not be comfortable handing an attacker the registers storing your public key before you removed the carry padding that made the computation faster. But it turns out that if you just initialize the carry padding randomly, subtracting out of the next word at the start to ensure everything sums to the correct result at the end, then you can show that the information leakage and chance-of-bad-overflow is exponentially suppressed in the length of the carry padding. Currently the most efficient known way to implement Shor's algorithm uses this technique [4].
> Aside: Why 13 bits instead of 12? For our purposes, we’re going to ignore the carries in the most significant limb, allowing numbers to wrap when they overflow past 2^256 - 1 (just like how unsigned addition works in C with normal size integer types). As a result, we can assign 52 bits to the most significant limb and ignore the fact that it will run out of room for carries before the other limbs do.
If you're going to wrap, you could go all the way and assign 64 bits to the most significant limb; that way you save 12 bits which you can spread to the other four limbs. You can go from {52, 51, 51, 51, 51} to {64, 48, 48, 48, 48}, so you have a spare 16 bits instead of 13 bits.
Indeed, if you use only 48 bits, you could also parallelize using the FP hardware - the mantissa is 52 bits, so if you use 48 bit limbs, you have 16 rounds before carry. Which is much less than 16 (or even 13) bits, but for processors which have distinct FP vs. integer adders, and that can issue them in parallel - you might get a speed boost.
The design of Poly1305 was directly informed by this very technique, with the addition (heh) of performing the arithmetic on the x87 FPU operating in 80-bit mode. It even has some zero padding bits placed strategically to aid the technique, hence the security bound of 2^106 instead of 2^128 or 2^130.
I have an arbitrary precision arithmetic library in my mind that represents numbers as streams of digits starting from the most significant digit. A redundant numeric representation could avoid unbounded backtracking (think flipping between 0.9999... and 1.0000... in a regular representation) and could guarantee forward progress for the generated digits.
I love that the same trick works for human brains as well as CPUs. Your brain already pays the price to represent numbers as high level abstract concepts rather than tightly packed fields. Just let the digits overflow, then clean them up later.
For my undergrad thesis I studied “Fibinary”, or Fibonacci coding. You can think of it like “base phi”, using Fibonacci numbers as the place values, and only 0,1.
The “normalized” representation, called Zeckendorf representation, perfectly packs error correction into the binary, as it ensures no number will ever have consecutive ones.
s = a ^ b; // sum bits
r = a & 0x7f7f7f7f; // clear msbs
t = b & 0x7f7f7f7f; // clear msbs
s = s & 0x80808080; // msb sum bits
r = r + t; // add without msbs, record carry-out in msbs
r = r ^ s; // sum of msb sum and carry-in bits, w/o carry-out
Very interesting. It may seem that the bad performance of the ADC instruction is due to bad design of the ALU, but I think this is more of an example of how replacing (single sequential task) by (several parallel tasks whose results are combined at the end) allows the CPU to finish faster.
He’s testing on Haswell, where (at least according to Agner Fog) ADC has an extra uop compared to ADD. On Broadwell and later, non-memory ADC is a single uop with a single cycle of latency. He would probably see a big speedup on a newer CPU.
Carries were a major problem in mechanical calculators as well. For example, the upper section of the Curta had a special bell & gearing to allow carries. The Friden STW - a mechanical behemoth - had a complex set of gears within the moving carriage to allow carries & borrows. Same with Monroe & Marchant mechanical calculators. For electric motor driven calculators, the additional weight of the moving carriage - as well as the gearing - set limits on how fast the machine could add and subtract.
It's amazing that the 17 instructions at the end to tidy up all the carries, which looks like it has to be done serially, is still faster. But I guess each register's carry bits are independent from the ones that get carries added from the last register, so it could be ...
mov W, E
mov V, D
mov U, C
mov T, B
shr W, 51
shr V, 51
shr U, 51
shr T, 51
add D, W
add C, V
add B, U
add A, T
Cool trick. I wonder how it would compare in performance to a carry-lookahead adder [1], which uses the fact that you can compute any carry digit directly from a combination of the original inputs (which is increasingly long the deeper you go).
This reminds me of "zeroless number systems" or "bijective number systems" (https://en.wikipedia.org/wiki/Bijective_numeration#The_bijec...) - in the language of the article, the allowed digits in base 10 are 123456789A. Note that 0 is not an allowed digit.
For instance here are the first few natural numbers in this system: 1,2,3,4,5,6,7,8,9,A,11,12...
I first read about them in Chris Okasaki's purely functional data structures book, they have some applications in terms of designing data structures (e.g. combining perfect-powers-of-two sized heaps to build heaps of arbitrary size).
They're conceptually interesting too, being a positional number system with no redundancy (exactly one way to represent each number) and without 0. Since the concept of a 0 digit is not required, one could argue that it's conceptually easier for e.g. a classical Roman mathematician to learn this than the positional system we do use today.
Also, excel's system for numbering columns uses this - A, B, C, ... X, Y, Z, AA, AB ...
A question here is whether this trick (which I acknowledge is very interesting) is worth doing on Broadwell. Broadwell has a pair of new instructions (ADCX and ADOX) that don't behave like a normal ADC - they only read and set a single bit (carry bit and overflow bit). As such you can do two of these at once and at least on recent Intel processors there are two execution ports (0 and 6) that can handle these instructions.
You'd still need some cleverness if your goal is to add just one giant bignum (as opposed to doing lots of independent add-with-carry sums).
>"The key insight here is that we can use this technique to delay carry propagation until the end. We can’t avoid carry propagation altogether, but we can avoid it temporarily. If we save up the carries that occur during the intermediate additions, we can propagate them all in one go at the end."
Brilliant and tremendously interesting!
Useful for if/when I write a fast super-long integer math library in the future (yes, I know GNU already has one, but it's always interesting to know/understand tricks like this for >64-bit, AKA "too-long-to-fit-in-a-single-register" integers...)
What I'm wondering about now is: does that 'adc' opcode have any other useful application except for adding numbers bigger than the width of a single register?
Because it does look a little bit as if it was engineered for that particular situation, and if that was the case, with this situation now in practice being solved much more efficiently without a single 'adc' instruction, the opcode might suddenly be completely useless bloat - which of course still had to be maintained forever, even though nobody uses it in practice, because it's part of the spec.
It's not useless bloat though. It's probably hard to beat in terms of code size, which is often even relevant for performance (due to cache size considerations).
Furthermore, the article is talking about adding many numbers, which is just one of the possible use cases. As far as I can tell, adc is still useful for adding a pair of 128-bit numbers on a 64-bit CPU for example.
I believe adc is still commonly used as a way of reading the carry bit. For example, assuming eax has already been zeroed, adc eax, eax would be the idiomatic way of reading the carry bit into eax.
The carry flag gets set by all sorts of instructions other than addition and subtraction (comparisons, bit shifts, multiplication, etc) so this is more useful than you might think.
x86 has a lot of "useless bloat" with legacy high-level instructions. Modern compilers avoid those and newer cpus aren't engineered with performance in mind for those instructions since nobody sane uses them anymore.
That might be another reason why it is slow -- it is an old opcode that has to be supported for compatibility, but isn't prioritized in any of the pipelines in newer chip designs.
The post did not mention potential cost of splitting and masking these into 5 limbs instead of 4. Perhaps I am missing that in this proposed system, 256-bit ints are always stored in 5 registers, then you only normalize when the result is needed? What are the costs of splitting/masking into 5-limbs and could the scaling impact the gains of parallel adds/subtracts meaningfully?
Since good bignum libraries already use tricks like this, and many others besides (like using FFT to handle very large bignum multiplication), it doesn't seem that the hardware needs to change.
There's a whole subfield of adder design focused on carrying, since it's a big dependency chain even within one register. I believe this is a software version of what some adder circuits already do (although I'll need some reading to check).
[+] [-] Strilanc|5 years ago|reply
One interesting fact about carry save adders is that the carry part of the register can be extended in order to avoid any carrying between words for very long periods of time. Instead of using 53 bits to represent 52 bits, use 60 bits, and now you can perform 256 sequential additions with no carrying between words before the carry part saturates and needs to be handled.
Somewhat surprisingly, it's even possible to use this construction in the context of a quantum computation [2][3]. The reason it's surprising is because tacking on additional registers like this can act as an information leakage mechanism (which is an error in a quantum computation). For analogy, you can imagine that if you computed a public key using this trick that you would not be comfortable handing an attacker the registers storing your public key before you removed the carry padding that made the computation faster. But it turns out that if you just initialize the carry padding randomly, subtracting out of the next word at the start to ensure everything sums to the correct result at the end, then you can show that the information leakage and chance-of-bad-overflow is exponentially suppressed in the length of the carry padding. Currently the most efficient known way to implement Shor's algorithm uses this technique [4].
1: https://en.wikipedia.org/wiki/Carry-save_adder
2: https://arxiv.org/abs/1905.08488
3: https://youtu.be/upTipX9yXNg?t=777
4: https://arxiv.org/abs/1905.09749
[+] [-] tspiteri|5 years ago|reply
If you're going to wrap, you could go all the way and assign 64 bits to the most significant limb; that way you save 12 bits which you can spread to the other four limbs. You can go from {52, 51, 51, 51, 51} to {64, 48, 48, 48, 48}, so you have a spare 16 bits instead of 13 bits.
[+] [-] beagle3|5 years ago|reply
[+] [-] gene91|5 years ago|reply
[+] [-] brandmeyer|5 years ago|reply
See also https://cr.yp.to/mac/poly1305-20050113.pdf
[+] [-] steerablesafe|5 years ago|reply
https://en.wikipedia.org/wiki/Redundant_binary_representatio...
https://en.wikipedia.org/wiki/Carry-save_adder
http://lux.dmcs.pl/csII/ca2_RedundantNS.pdf
I have an arbitrary precision arithmetic library in my mind that represents numbers as streams of digits starting from the most significant digit. A redundant numeric representation could avoid unbounded backtracking (think flipping between 0.9999... and 1.0000... in a regular representation) and could guarantee forward progress for the generated digits.
[+] [-] jjoonathan|5 years ago|reply
[+] [-] jchook|5 years ago|reply
The “normalized” representation, called Zeckendorf representation, perfectly packs error correction into the binary, as it ensures no number will ever have consecutive ones.
It also has applications in data compression.
https://en.m.wikipedia.org/wiki/Fibonacci_coding
[+] [-] majke|5 years ago|reply
https://github.com/constructor-igor/cudafy/blob/11cdd4def4e7...
[+] [-] posix_me_less|5 years ago|reply
[+] [-] cwzwarich|5 years ago|reply
[+] [-] csense|5 years ago|reply
[+] [-] CliffStoll|5 years ago|reply
[+] [-] ramshorns|5 years ago|reply
[+] [-] Suncho|5 years ago|reply
No they're not. Let's say B's non-carry bits are all 1. If you carry anything from C, that will affect B's carry bits.
[+] [-] gpvos|5 years ago|reply
[+] [-] bobbiechen|5 years ago|reply
[1] https://en.wikipedia.org/wiki/Carry-lookahead_adder
[+] [-] jonsen|5 years ago|reply
[+] [-] munificent|5 years ago|reply
[+] [-] zodiac|5 years ago|reply
For instance here are the first few natural numbers in this system: 1,2,3,4,5,6,7,8,9,A,11,12...
I first read about them in Chris Okasaki's purely functional data structures book, they have some applications in terms of designing data structures (e.g. combining perfect-powers-of-two sized heaps to build heaps of arbitrary size).
They're conceptually interesting too, being a positional number system with no redundancy (exactly one way to represent each number) and without 0. Since the concept of a 0 digit is not required, one could argue that it's conceptually easier for e.g. a classical Roman mathematician to learn this than the positional system we do use today.
Also, excel's system for numbering columns uses this - A, B, C, ... X, Y, Z, AA, AB ...
[+] [-] pachico|5 years ago|reply
[+] [-] glangdale|5 years ago|reply
You'd still need some cleverness if your goal is to add just one giant bignum (as opposed to doing lots of independent add-with-carry sums).
[+] [-] peter_d_sherman|5 years ago|reply
Brilliant and tremendously interesting!
Useful for if/when I write a fast super-long integer math library in the future (yes, I know GNU already has one, but it's always interesting to know/understand tricks like this for >64-bit, AKA "too-long-to-fit-in-a-single-register" integers...)
[+] [-] Slartie|5 years ago|reply
Because it does look a little bit as if it was engineered for that particular situation, and if that was the case, with this situation now in practice being solved much more efficiently without a single 'adc' instruction, the opcode might suddenly be completely useless bloat - which of course still had to be maintained forever, even though nobody uses it in practice, because it's part of the spec.
[+] [-] FartyMcFarter|5 years ago|reply
Furthermore, the article is talking about adding many numbers, which is just one of the possible use cases. As far as I can tell, adc is still useful for adding a pair of 128-bit numbers on a 64-bit CPU for example.
[+] [-] mrfredward|5 years ago|reply
The carry flag gets set by all sorts of instructions other than addition and subtraction (comparisons, bit shifts, multiplication, etc) so this is more useful than you might think.
[+] [-] 0x0|5 years ago|reply
[+] [-] waynecochran|5 years ago|reply
https://link.springer.com/article/10.1007/s11432-015-5411-x https://arxiv.org/pdf/1411.5949.pdf
[+] [-] andrewla|5 years ago|reply
[+] [-] fegu|5 years ago|reply
[+] [-] londons_explore|5 years ago|reply
[+] [-] eumenides1|5 years ago|reply
[+] [-] jonsen|5 years ago|reply
[+] [-] barbegal|5 years ago|reply
[+] [-] commandlinefan|5 years ago|reply
[+] [-] Y_Y|5 years ago|reply
[+] [-] smithza|5 years ago|reply
[+] [-] dreamcompiler|5 years ago|reply
[+] [-] gautamcgoel|5 years ago|reply
[+] [-] not2b|5 years ago|reply
[+] [-] kwillets|5 years ago|reply