top | item 22970179

(no title)

fyp | 5 years ago

Isn't that the wrong direction for the optimization? I would assume you would want to compile adding two numbers into shifting by one, not the other way around.

(I know nothing about hardware, it just intuitively seems like moving a bunch of bits over by 1 should be faster than dealing with xor and carries)

discuss

order

jcranmer|5 years ago

In hardware terms, adders are simpler than shifters. You can usually do both in a single cycle, but it's going to be lower power to do the add instead of the shift.

To put this in more concrete terms: an N-bit adder involves N 1-bit stages to add each bit, and then a 1-bit carry network on top of that, which has N stages in it. So overall, it's O(N) in terms of hardware. An N-bit shift unit is going to use lg N N-bit muxes--or O(N lg N) in terms of hardware. Total gate delay in both cases is O(lg N), but adders have O(N) hardware (and thus energy consumption) while shifters have O(N lg N).

A secondary consequence of being larger area is that a superscalar architecture may choose to have one execution unit that has an adder and a shifter and a second that only has the adder. So an addition may schedule better than a shift, since there are more things it can execute on.

Tuna-Fish|5 years ago

> To put this in more concrete terms: an N-bit adder involves N 1-bit stages to add each bit, and then a 1-bit carry network on top of that, which has N stages in it. So overall, it's O(N) in terms of hardware.

O(N) adders cannot meet the latency demands of modern high-frequency CPUs. The actual complexity of adders in real CPUs is usually O(N²).

simcop2387|5 years ago

There's some hardware that surprisingly doesn't have a real shift but instead has rotate operations. These will take the bits that get dropped off and put them on the other side, in those cases the addition can be a much better choice than doing a bitmask and then rotate operation. These types of hardware are usually embedded devices that also have high cost multiplication instructions too so unrolling to a smaller number of fixed additions can actually perform better sometimes.

magicalhippo|5 years ago

> it just intuitively seems like moving a bunch of bits over by 1 should be faster than dealing with xor and carries

Yes, a fixed shift-by-one unit would be much simpler than an adder. But many (most?) CPUs that supports shifting have generic shift units, where the number of bits to shift varies, and that makes them much more complex.

kevin_thibedeau|5 years ago

Barrel shifters are still frequently omitted from microcontrollers. Primarily because of their size.

mhh__|5 years ago

Realistically when you get into real world questions about performance the only way to be sure is to measure it.

In this case I imagine you're right. Although also worth pointing out that due to the way modern CPUs are basically frontends to generate uOps that it could actually perform the optimization by itself anyway. Time to break out PAPI (very cool tool for anyone unaware, you can get instruction level profiling in your program with basically 4 function calls and a header file).