The radix 2^51 trick (2017)

[+] Strilanc|5 years ago|reply

In hardware this would be called a carry save adder [1].

One interesting fact about carry save adders is that the carry part of the register can be extended in order to avoid any carrying between words for very long periods of time. Instead of using 53 bits to represent 52 bits, use 60 bits, and now you can perform 256 sequential additions with no carrying between words before the carry part saturates and needs to be handled.

Somewhat surprisingly, it's even possible to use this construction in the context of a quantum computation [2][3]. The reason it's surprising is because tacking on additional registers like this can act as an information leakage mechanism (which is an error in a quantum computation). For analogy, you can imagine that if you computed a public key using this trick that you would not be comfortable handing an attacker the registers storing your public key before you removed the carry padding that made the computation faster. But it turns out that if you just initialize the carry padding randomly, subtracting out of the next word at the start to ensure everything sums to the correct result at the end, then you can show that the information leakage and chance-of-bad-overflow is exponentially suppressed in the length of the carry padding. Currently the most efficient known way to implement Shor's algorithm uses this technique [4].

1: https://en.wikipedia.org/wiki/Carry-save_adder

2: https://arxiv.org/abs/1905.08488

3: https://youtu.be/upTipX9yXNg?t=777

4: https://arxiv.org/abs/1905.09749

[+] tspiteri|5 years ago|reply

> Aside: Why 13 bits instead of 12? For our purposes, we’re going to ignore the carries in the most significant limb, allowing numbers to wrap when they overflow past 2^256 - 1 (just like how unsigned addition works in C with normal size integer types). As a result, we can assign 52 bits to the most significant limb and ignore the fact that it will run out of room for carries before the other limbs do.

If you're going to wrap, you could go all the way and assign 64 bits to the most significant limb; that way you save 12 bits which you can spread to the other four limbs. You can go from {52, 51, 51, 51, 51} to {64, 48, 48, 48, 48}, so you have a spare 16 bits instead of 13 bits.

[+] beagle3|5 years ago|reply

Indeed, if you use only 48 bits, you could also parallelize using the FP hardware - the mantissa is 52 bits, so if you use 48 bit limbs, you have 16 rounds before carry. Which is much less than 16 (or even 13) bits, but for processors which have distinct FP vs. integer adders, and that can issue them in parallel - you might get a speed boost.

[+] gene91|5 years ago|reply

Exactly. Any one can offer an explanation why they didn't take this path for 256-bit numbers instead?

[+] brandmeyer|5 years ago|reply

The design of Poly1305 was directly informed by this very technique, with the addition (heh) of performing the arithmetic on the x87 FPU operating in 80-bit mode. It even has some zero padding bits placed strategically to aid the technique, hence the security bound of 2^106 instead of 2^128 or 2^130.

[+] steerablesafe|5 years ago|reply

Redundant number systems are a really great topic.

https://en.wikipedia.org/wiki/Redundant_binary_representatio...

https://en.wikipedia.org/wiki/Carry-save_adder

http://lux.dmcs.pl/csII/ca2_RedundantNS.pdf

I have an arbitrary precision arithmetic library in my mind that represents numbers as streams of digits starting from the most significant digit. A redundant numeric representation could avoid unbounded backtracking (think flipping between 0.9999... and 1.0000... in a regular representation) and could guarantee forward progress for the generated digits.

[+] jjoonathan|5 years ago|reply

I love that the same trick works for human brains as well as CPUs. Your brain already pays the price to represent numbers as high level abstract concepts rather than tightly packed fields. Just let the digits overflow, then clean them up later.

    14*14
    140 + 14*4
    140 + 4(16)  <-- 4 tens + 16 ones
    18(16)
    196

[+] jchook|5 years ago|reply

For my undergrad thesis I studied “Fibinary”, or Fibonacci coding. You can think of it like “base phi”, using Fibonacci numbers as the place values, and only 0,1.

The “normalized” representation, called Zeckendorf representation, perfectly packs error correction into the binary, as it ensures no number will ever have consecutive ones.

It also has applications in data compression.

https://en.m.wikipedia.org/wiki/Fibonacci_coding

[+] majke|5 years ago|reply

Funnily enough this is the code I was reading today

https://github.com/constructor-igor/cudafy/blob/11cdd4def4e7...

    s = a ^ b;          // sum bits
    r = a & 0x7f7f7f7f; // clear msbs
    t = b & 0x7f7f7f7f; // clear msbs
    s = s & 0x80808080; // msb sum bits
    r = r + t;          // add without msbs, record carry-out in msbs
    r = r ^ s;          // sum of msb sum and carry-in bits, w/o carry-out

[+] posix_me_less|5 years ago|reply

Very interesting. It may seem that the bad performance of the ADC instruction is due to bad design of the ALU, but I think this is more of an example of how replacing (single sequential task) by (several parallel tasks whose results are combined at the end) allows the CPU to finish faster.

[+] cwzwarich|5 years ago|reply

He’s testing on Haswell, where (at least according to Agner Fog) ADC has an extra uop compared to ADD. On Broadwell and later, non-memory ADC is a single uop with a single cycle of latency. He would probably see a big speedup on a newer CPU.

[+] csense|5 years ago|reply

It's a data flow problem. Four ADD instructions can be parallelized. Four ADC instructions have to run serially.

[+] CliffStoll|5 years ago|reply

Carries were a major problem in mechanical calculators as well. For example, the upper section of the Curta had a special bell & gearing to allow carries. The Friden STW - a mechanical behemoth - had a complex set of gears within the moving carriage to allow carries & borrows. Same with Monroe & Marchant mechanical calculators. For electric motor driven calculators, the additional weight of the moving carriage - as well as the gearing - set limits on how fast the machine could add and subtract.

[+] ramshorns|5 years ago|reply

It's amazing that the 17 instructions at the end to tidy up all the carries, which looks like it has to be done serially, is still faster. But I guess each register's carry bits are independent from the ones that get carries added from the last register, so it could be ...

  mov W, E
  mov V, D
  mov U, C
  mov T, B
  shr W, 51
  shr V, 51
  shr U, 51
  shr T, 51
  add D, W
  add C, V
  add B, U
  add A, T

which does seem like it could be parallelized.

[+] Suncho|5 years ago|reply

> each register's carry bits are independent from the ones that get carries added from the last register

No they're not. Let's say B's non-carry bits are all 1. If you carry anything from C, that will affect B's carry bits.

[+] gpvos|5 years ago|reply

Well, it's only faster if you do three or more additions before that. But otherwise yes.

[+] bobbiechen|5 years ago|reply

Cool trick. I wonder how it would compare in performance to a carry-lookahead adder [1], which uses the fact that you can compute any carry digit directly from a combination of the original inputs (which is increasingly long the deeper you go).

[1] https://en.wikipedia.org/wiki/Carry-lookahead_adder

[+] jonsen|5 years ago|reply

To compare you would have to implement carry-lookahead in software. How would you do that?

[+] munificent|5 years ago|reply

Ah, this post is so good. This is why I come to HN.

[+] zodiac|5 years ago|reply

This reminds me of "zeroless number systems" or "bijective number systems" (https://en.wikipedia.org/wiki/Bijective_numeration#The_bijec...) - in the language of the article, the allowed digits in base 10 are 123456789A. Note that 0 is not an allowed digit.

For instance here are the first few natural numbers in this system: 1,2,3,4,5,6,7,8,9,A,11,12...

I first read about them in Chris Okasaki's purely functional data structures book, they have some applications in terms of designing data structures (e.g. combining perfect-powers-of-two sized heaps to build heaps of arbitrary size).

They're conceptually interesting too, being a positional number system with no redundancy (exactly one way to represent each number) and without 0. Since the concept of a 0 digit is not required, one could argue that it's conceptually easier for e.g. a classical Roman mathematician to learn this than the positional system we do use today.

Also, excel's system for numbering columns uses this - A, B, C, ... X, Y, Z, AA, AB ...

[+] pachico|5 years ago|reply

I really enjoyed it. I might never use what I learned by reading it but it was so well explained. Bravo!

[+] glangdale|5 years ago|reply

A question here is whether this trick (which I acknowledge is very interesting) is worth doing on Broadwell. Broadwell has a pair of new instructions (ADCX and ADOX) that don't behave like a normal ADC - they only read and set a single bit (carry bit and overflow bit). As such you can do two of these at once and at least on recent Intel processors there are two execution ports (0 and 6) that can handle these instructions.

You'd still need some cleverness if your goal is to add just one giant bignum (as opposed to doing lots of independent add-with-carry sums).

[+] peter_d_sherman|5 years ago|reply

>"The key insight here is that we can use this technique to delay carry propagation until the end. We can’t avoid carry propagation altogether, but we can avoid it temporarily. If we save up the carries that occur during the intermediate additions, we can propagate them all in one go at the end."

Brilliant and tremendously interesting!

Useful for if/when I write a fast super-long integer math library in the future (yes, I know GNU already has one, but it's always interesting to know/understand tricks like this for >64-bit, AKA "too-long-to-fit-in-a-single-register" integers...)

[+] Slartie|5 years ago|reply

What I'm wondering about now is: does that 'adc' opcode have any other useful application except for adding numbers bigger than the width of a single register?

Because it does look a little bit as if it was engineered for that particular situation, and if that was the case, with this situation now in practice being solved much more efficiently without a single 'adc' instruction, the opcode might suddenly be completely useless bloat - which of course still had to be maintained forever, even though nobody uses it in practice, because it's part of the spec.

[+] FartyMcFarter|5 years ago|reply

It's not useless bloat though. It's probably hard to beat in terms of code size, which is often even relevant for performance (due to cache size considerations).

Furthermore, the article is talking about adding many numbers, which is just one of the possible use cases. As far as I can tell, adc is still useful for adding a pair of 128-bit numbers on a 64-bit CPU for example.

[+] mrfredward|5 years ago|reply

I believe adc is still commonly used as a way of reading the carry bit. For example, assuming eax has already been zeroed, adc eax, eax would be the idiomatic way of reading the carry bit into eax.

The carry flag gets set by all sorts of instructions other than addition and subtraction (comparisons, bit shifts, multiplication, etc) so this is more useful than you might think.

[+] 0x0|5 years ago|reply

x86 has a lot of "useless bloat" with legacy high-level instructions. Modern compilers avoid those and newer cpus aren't engineered with performance in mind for those instructions since nobody sane uses them anymore.

[+] waynecochran|5 years ago|reply

Hopefully this will all be replaced with quantum computer that does addition with ripple carry or uses QFT... :)

https://link.springer.com/article/10.1007/s11432-015-5411-x https://arxiv.org/pdf/1411.5949.pdf

[+] andrewla|5 years ago|reply

That might be another reason why it is slow -- it is an old opcode that has to be supported for compatibility, but isn't prioritized in any of the pipelines in newer chip designs.

[+] fegu|5 years ago|reply

This is the reason some of the esoteric x86 instructions actually perform slower on newer hardware. They are just there for backwards compatibility.

[+] londons_explore|5 years ago|reply

ADC is still the quickest way if you need to add two 256 bit numbers.

[+] eumenides1|5 years ago|reply

This is really cool, but gives me PTSD from my compliers course.

[+] jonsen|5 years ago|reply

Were you forced to comply against your will?

[+] barbegal|5 years ago|reply

What sort of real world use case adds large numbers like this?

[+] commandlinefan|5 years ago|reply

public-key crypto.

[+] Y_Y|5 years ago|reply

hn upvotes

[+] smithza|5 years ago|reply

The post did not mention potential cost of splitting and masking these into 5 limbs instead of 4. Perhaps I am missing that in this proposed system, 256-bit ints are always stored in 5 registers, then you only normalize when the result is needed? What are the costs of splitting/masking into 5-limbs and could the scaling impact the gains of parallel adds/subtracts meaningfully?

[+] dreamcompiler|5 years ago|reply

TLDR: Denormalized polynomial addition followed by normalization is faster than normalized polynomial addition on Haswell. Nice trick!

[+] gautamcgoel|5 years ago|reply

What I want to know is what, if any, are the implications for CPU architecture. How should bignum arithmetic be efficiently implemented in hardware?

[+] not2b|5 years ago|reply

Since good bignum libraries already use tricks like this, and many others besides (like using FFT to handle very large bignum multiplication), it doesn't seem that the hardware needs to change.

[+] kwillets|5 years ago|reply

There's a whole subfield of adder design focused on carrying, since it's a big dependency chain even within one register. I believe this is a software version of what some adder circuits already do (although I'll need some reading to check).

83 comments