(no title)
fyp | 5 years ago
(I know nothing about hardware, it just intuitively seems like moving a bunch of bits over by 1 should be faster than dealing with xor and carries)
fyp | 5 years ago
(I know nothing about hardware, it just intuitively seems like moving a bunch of bits over by 1 should be faster than dealing with xor and carries)
jcranmer|5 years ago
To put this in more concrete terms: an N-bit adder involves N 1-bit stages to add each bit, and then a 1-bit carry network on top of that, which has N stages in it. So overall, it's O(N) in terms of hardware. An N-bit shift unit is going to use lg N N-bit muxes--or O(N lg N) in terms of hardware. Total gate delay in both cases is O(lg N), but adders have O(N) hardware (and thus energy consumption) while shifters have O(N lg N).
A secondary consequence of being larger area is that a superscalar architecture may choose to have one execution unit that has an adder and a shifter and a second that only has the adder. So an addition may schedule better than a shift, since there are more things it can execute on.
Tuna-Fish|5 years ago
O(N) adders cannot meet the latency demands of modern high-frequency CPUs. The actual complexity of adders in real CPUs is usually O(N²).
unknown|5 years ago
[deleted]
simcop2387|5 years ago
magicalhippo|5 years ago
Yes, a fixed shift-by-one unit would be much simpler than an adder. But many (most?) CPUs that supports shifting have generic shift units, where the number of bits to shift varies, and that makes them much more complex.
kevin_thibedeau|5 years ago
mhh__|5 years ago
In this case I imagine you're right. Although also worth pointing out that due to the way modern CPUs are basically frontends to generate uOps that it could actually perform the optimization by itself anyway. Time to break out PAPI (very cool tool for anyone unaware, you can get instruction level profiling in your program with basically 4 function calls and a header file).