In the post, multiplications and 2 additions are not faster than 2 additions.
The post compares (1) loop code that can be vectorized, as loop rounds are independent and do not depend on the result from the previous round, and (2) an "optimization" that makes calculations shorter, but also makes each loop round depend on the result of the previous round, so this cannot be vectorized.
More generally, “stop storing state, unless it makes the program less complicated.”
The first version is simple. The second version is more complicated. The simplicity is because the first version can be represented as a formula; imagine trying to write out the second version as a formula, and the complexity becomes obvious.
(The addition loop would have to be written as a recurrence relation, which of course means every step depends on the previous step.)
Complexity isn’t always obvious. It’s sometimes common in C programs to write a loop like dst[index++] = src[i], particularly in deeply nested for-loops. In my experience it’s almost always worth rewriting it so that the indices are computable entirely from the loop iteration variables, with no state. It helps you build a mental map of the operations, because you can think of it in terms of geometry (memcpy = copying rectangles onto other larger rectangles) whereas when the index is stateful it becomes quite a lot harder to visualize in higher dimensions. At least for me.
We’ve been building a compiler at Groq for our custom ML hardware. (Roughly, “turn pytorch into our ISA with no extra programmer effort.”) I used to think of posts like this as “Well, modern compilers are so fast, who really cares about autovectorization?” — it turns out you care when you need to write your own compiler. :)
MLIR is pretty cool. It makes a lot of transformations like this pretty easy. The MLIR “krnl” dialect can also automatically transform nested loops into tiled iteration. Graphics devs will know what I mean — no need to loop over 8x8 blocks of pixels manually, just write “for x in range(width): for y in range(height): …” and set the block size to 8.
> In the post, multiplications and 2 additions are not faster than 2 additions.
Reminded me of when I tried to optimize some code using assembly on x86, maybe 8-10 years go. The compiler spit out a DIV [EDI] instruction which was inside a hot loop. I saw how I could easily rewrite it to use a register instead of the indirect memory access.
So I did, and it was significantly slower, like 10-15%... I even took care to align the loop target address and so on. If I changed my code to use the memory indirection it was measurably faster.
And with that I decided hand-optimizing x86 assembly was definitely a skill to leave behind.
The other issue is instruction-level parallelism, as another poster in TFA pointed out. Even within a single loop iteration the "unoptimized" code is more likely to exploit multiple ALUs if they exist, regardless of vectorization instructions.
To be precise, they're both "vectorized" in the sense that both versions are using SSE vector instructions (and in fact, Clang will even generate AVX instructions for the first version if you use -mavx2). The difference is really the data dependency which has a massive effect on the ability of the CPU to pipeline the operation.
Tldr: autovectorizer transforms the first code into SIMD instructions, but the second into 64 bit instructions.
SIMD is very powerful, and modern compilers can sometimes simd-ify your code automatically.
-------
The second code can probably become SIMD as well, but it's beyond GCC's ability to autovectorizer it in that form. I kinda want to give it a go myself but don't have time today...
Autovectorizers are mysterious, often harder to see and use than explicitly SIMD code (like OpenCL or CUDA).
> The second code can probably become SIMD as well, but it's beyond GCC's ability to autovectorizer it in that form.
Usually, for floating point operations the compiler simply has no chance to do anything clever. NaNs, infinities and signed zero mean that even the most "obvious" identities don't actually hold. For example, x + 0 == x (where the == is bitwise) does not hold for x = -0.
both versions use SSE and are pipelined, the problem with the second one is data dependency, only two adds but the second one directly depends on the first ones result = stall
> Tldr: autovectorizer transforms the first code into SIMD instructions, but the second into 64 bit instructions.
That's not the tldr. It's actually more wrong than right.
The actual TLDR is: loop dependencies prohibit the compiler _and_ your CPU from parallelizing. The SIMD part is the compiler, the more important part here is the CPU though.
Long explanation:
Let's start with the compiler. Yes, the first version can be vectorized and you can see that in the included screenshots of the assembly output. The use of addpd [0] (add 2 pairs of 64bit doubles in simultaneously) instead of addsd (add 1 pair of 64bit doubles). Both operations have identical throughput/latency on any architecture not too ancient (e.g. check information of the corresponding intrinsic here [3]. Unfortunately Agner instruction_tables doesn't include the ADDSD for most architectures.) So while you can crunch 2 operations using SIMD at the same time, you are still doing 3 multiplications and 2 additions instead of 2 additions. The multiplications and addition have the same throughput at least on Skylake. So we can use simple math and 5/2 is still more than 2!
Now to the CPU part. The loop dependency prohibits the compiler to properly utilize instruction pipelining [4] and now differences in throughput vs latency come into play. In a pipelined scenario the CPU can work on multiple steps of the instruction cycle in parallel (from fetching the instruction, then decoding it, to executing the operation and eventually writing the result back to the register) and thus achieve higher throughput. However, if your next instruction depends on the instruction before the CPU has to finish all the steps of a pipeline from instruction start to finish before the next instruction can start. This is the latency of an instruction. For example mulsd/mulpd have 8 times higher latency than throughput on Intel Skylake.
So while SIMD comes in at a factor of 2, the loop dependency stops the CPU from realizing a potential factor of 8.
Peter Cordes also correctly points this out in his comments and answer to the question on stackoverflow.
I think it's worth pointing out that the reason why these two examples execute at different speed is due to how compiler translated code AND because CPU was able to parallelize work. Compilers take knowledge about target platform (e.g. instruction set) and code and translate it into executable code. Compiler CAN (but doesn't have to) rewrite code only if it ALWAYS produces the same result as input code.
I feel like last 110-15 years (majority of) people have stopped thinking about specific CPU and only think about ISA. That works for a lot of workloads but in recent years I have observed that there is more and more interest in how specific CPU can execute code as efficiently as possible.
If you're interested in the kind of optimizations performed in the example you should check out polyhedral compilation (https://polyhedral.info/) and halide (https://halide-lang.org/). Both can be used to speed up certain workloads significantly.
Seems like you can have the cake and eat it too, by manually parallelizing the code to something like this:
double A4 = A+A+A+A;
double Z = 3A+B;
double Y1 = C;
double Y2 = A+B+C;
int i;
// ... setup unroll when LEN is odd...
for(i=0; i<LEN; i++) {
data[i] = Y1;
data[++i] = Y2;
Y1 += Z;
Y2 += Z;
Z += A4;
}
Probably not entirely functional as written, but you get the idea: unroll the loop so that the data dependent paths can each be done in parallel. For the machine being considered, a 4 step unroll should achieve maximum performance, but of course, you get all the fun things that come with hard-coding the architecture in your software.
The part about data dependencies across loop iterations is fascinating to me, becuase it's mostly invisible even when you look at the generated assembly. There's a related optimization that comes up in implementations of ChaCha/BLAKE, where we permute columns around in a kind of weird order, because it breaks a data dependency for an operation that's about to happen: https://github.com/sneves/blake2-avx2/pull/4#issuecomment-50...
The pipelining issue is interesting because my reaction becomes “shouldn’t the CPU just come with a larger vector size and then operate on chunks within the vector to optimize pipelining?” but then I realize I’m just describing a GPU.
I doubt repeatedly adding floating point numbers is a good idea. With every addition the sum increases further away from the addend, and with their growing relative difference problems grow as well.
Just because the algebra works it doesn't guarantee that the code does, too.
Your points are factually correct, but in practice not a big concern, if your floating point value is much more precise than necessary for the numbers used.
E.g. if you use doubles for the range of 32bit integers even adding 1.0 2 billion times to 2 billion still ends up at 4 billion. Even adding 1.0 20 billion times to 20 billion ends up at 40 billion. Now adding 0.1 20 billion times to 2 billion ends up 1908 short on my CPU, i.e. about 19080 absorptions/rounding errors occured. You need some serious differences and amount of operations to actually trigger errors.
It’s almost like managing the state yourself on a modern cpu is a complicated task. Automatic vectorisation, reordering, etc can all be more easily done to a pure declarative program. Imperatively managing the state and the control flow with an ancient language like C etc really does no longer reflect the underlying hardware which is significantly more advanced.
The general rule to follow in power consumption on CPUs is to do your work quickly and then get to sleep. Propagating clock is going to eat the bulk of your power. The mild difference between multiply and add in actual usage is inside the noise (orders of magnitude smaller). The bigger penalty in this case is the inter-iteration dependency, which, vectorized or not, runs the risk of holding up the whole show due to pipelining.
As a performance rule on modern processors: avoid using the result of a calculation as long as you reasonably can (in tight loops... You don't want to be out of cache.).
Usually faster version always consumes less power, as this allows the core more time in sleep. This is known as the race-to-sleep or race-to-idle paradox.
The problem here is that the additions depends on values computed in the previous iteration of the loop. The version with multiplication is faster because there is no dependencies with the previous iteration so the CPU has more freedom scheduling the operations.
In this case since the slower one is spending most of its time stalled out, the faster one will probably be more power efficient. This isn't an AVX512 power-gobbling monstrosity issue.
Besides the autovectorization, the second version also has two additional assignments. Depending on how the storage is used for these, whether it’s to registers, L1/L2, or stack, there might be performance hit.
First: we need to finally stop the harmful myth that floating point multiplication is slower than addition. This has not been true for a long while.
Second: why are so many people insisting that the loop is auto-vectorised? Is there any evidence to that? Data dependencies alone explain the observed performance delta. Auto-vectorization would have resulted in a higher speedup.
Vectorisation is not free, There is one other dimension to optimise for: power.
The suggested "slower" optimisation does fundamentally use less instructions. Chucking more hardware at parallelisable problems makes it run faster but does not necessarily reduce the power requirements much because there are fundamentally the same number of instructions, it's just gobbling up the same power over a shorter period of time - The "slower" serial algorithm uses less instructions and in theory less power in total. Disclaimer: I mean power vs speed fundamentally in theory, in practice there may be no measurable difference depending on the particular CPU and code.
Do you have a reference for that? I googled and I couldn't find anything good.
While it might sound intuitive that SIMD instructions consume more power, I don't think that's necessarily true to a relevant degree in practice. My understanding is CPU power consumption is mostly tied to inefficiences that cause energy loss via heat, while the actual computation doesn't consume any energy per se. So electrons traveling a more complex path probably cause somehwat more energy loss as there is more wire/transistors to pass. But most of the total loss doesn't actually occure in the ALU. Empirically from what you can see operating systems do, the most effective way of consuming less power is actually running on a slower clock cycle and the most effective way to achieve that is getting work done faster and that's not tied to the number of instructions.
The Stackoverflow question here [1] seems to suggest that SIMD vs no SIMD has a neglectable overhead compared to entering a lower power state sooner.
But you’re not counting static power of the CPU and the whole system. It is possible for the vectorized code to consume less power overall because it can complete faster, using less static power.
We mean energy, right? The surprising fact is that the energy cost of executing an instruction (scheduling etc.) is much higher than the actual operation. Thus SIMD amortizes that per-instruction overhead over multiple elements, leading to 5-10x gains (
https://pdfs.semanticscholar.org/f464/74f6ae2dde68d6ccaf9f53...).
Even in this example, which apparently has 4x vector instructions vs 1x scalar, AVX-512 (and probably even AVX2) would reduce energy usage because they can do 8 (or 4 for AVX2) 64-bit elements per cycle.
The slower algo will use more instructions as will have to run for more iterations. The faster algo will use wider ALUs that are more power hungry, so it appears a wash. But because less instructions are in flight, less power need to be spent to keep track of them or renaming registers, caches need to powered for a smaller amount of time and so on.
Unless we're talking extremes (AVX512...), optimizing for power in modern CPUs is almost always optimizing the race to idle.
I.e. the CPU doesn't use much less power if most ALUs are idle, so you might as well use all of them to compute redundant results -> and then turn ~everything off sooner.
I curious about how an optimiser determines that a block of code can be vectorised. It's trivial to see this in the initial version of compute() but I'm not sure how an optimiser does this. Is it as simple as checking that A, B, and C are cost?
And how far would an optimiser typically take its analysis? For example, if B was defined inside the loop as (non -const) A * 2 ? Or as A * f() , where f() always returns 2, maybe using a constant or maybe a calculation etc.
Oh man I have been smoked by data-dependency like this so many times. I've gotten a lot better over the years at "seeing" data dependencies in godbolt or whatever, but it still slips through by my code and that of my colleagues way more often than I'd like.
Is anyone aware of good tooling for automatically catching this sort of thing even some of the time?
I'm not sure if there are existing rules for it, but you could write a CodeQL query looking for data dependencies in loops. Obviously dependencies are sometimes required, but it at least could tell your they were there.
Don’t use state in a loop unless you have to? Avoiding such use of state also makes code a lot easier to read and debug - the performance benefits are really just a bonus.
I'm running into an interesting result -- on my machine, the fast version runs at the same speed at LEN=1000000 (the default in the sample), but starts running 1.5 times faster at LEN=500000 and ends up twice as fast at LEN=100000 and lower. This is with gcc -O3. Why could this be?
For fun, I tried this in go on M1 Mac (basically because benchmarking in go is so easy)
And... the two codes runs with exactly the same speed, on M1 Mac, with go.
edit: of course, with go, you can manually parallelize the faster option with goroutines... but that does something else, doesn't it. (and it's 500x faster.)
Funny that this question gained 180 upvotes in 10 days when it also could have received the reverse for being quite lacking in things the author has tried to figure the (rather obvious) data dependency out on his own.
Is the implication here that tail call optimizations don't work anymore? They might seem to do the proper thing on the language level but the CPU just can't think that way.
Tail calls will work the same as loop. So if you have data dependencies between loop iterations or between tail-call-eliminated stack frames, then it will be slower than if you do not have those dependencies.
The same considerations apply, though not always. JavaScript can be JIT compiled, though the JIT might not make the same optimisations as a C compiler. It's running on the same CPU though.
If we ignore the details, then this is a perfect example of how simpler code should be preferred _by default_ over complex code. Both for performance and reasoning. In this case and often elsewhere, these are related things.
I'm taking the definition of simple and complex (or complected) from this talk[0]. In short: Simple means individual pieces are standing on their own, not necessarily easy or minimal. Complex means individual pieces are braided together into a whole.
The first code is simpler than the second. Just have a look at the two solutions and you see what I mean: The individual instructions in the first stand on their own and clearly declare what they mean as well. The second example is more complex, because each line/subexpression cannot be understood in isolation, there is an implicit coupling that requires the programmer _and_ the machine code interpreter to understand the whole thing for it to make sense. The CPU apparently fails at that and cannot optimize the machine instructions into more efficient microcode.
The example illustrates performance benefits of simpler code, but there are others too:
For example a type checker might be able to infer things from simpler code, but not from complex code. Again, complexity is about coupling, which can for example arise from making assumptions about things that are out of bounds or logic that happens in some other part of your program, which _this_ part relies on.
Things like these can either be outright rejected by a type checker or simply ignored and sometimes we can provide additional ceremony to make it happy. But there is always an underlying question when these issues arise: Can I express this in a simpler way? It's sometimes possible to describe rich semantics with simpler, more primitive types and explicit coordination.
Another thing that comes to mind are rich ORMs. They can be very convenient for common cases. But they have footguns for the general case, because they _complect_ so many things, such as validation, storage, domain rules, caching etc. And they are leaky abstractions because we have to do certain things in certain ways to avoid bloated SQL, N+1 etc.
They are not simple (and certainly not declarative) so we program against a black box. There are simpler "ORMs" that I very much like, but they typically only provide a minimal set of orthogonal features, such as query builders, and mapping results into plain data structures. It is simpler to use these kind of orthogonal tools.
Last but not least: Simple code can be pulled apart and changed more easily without having to worry about introducing problems. The substitution rule or referential transparency enables this by guaranteeing that each expression can be replaced by the value it will evaluate to. This also implies that other expressions that evaluate to the same value can be substituted freely.
You can't ignore the details if you want the fastest performance. Unfortunately sometimes the solution that's the fastest is also more complex. Simplicity tends to win and also is less likely to be buggy, easier to maintain etc but sometimes the fastest code deliberately introduces coupling, hardware specific cache shenanigans or unreadable hand written asm and/or inlined SIMD. This is often counter to your hypothesis.
In the 2 additions version, computation of the next iteration depends on the results of the preceding iteration. In the multiplication version, the computations are independent for each iteration, enabling parallel execution (by SIMD and/or pipelined/superscalar execution).
sampo|3 years ago
The post compares (1) loop code that can be vectorized, as loop rounds are independent and do not depend on the result from the previous round, and (2) an "optimization" that makes calculations shorter, but also makes each loop round depend on the result of the previous round, so this cannot be vectorized.
sillysaurusx|3 years ago
The first version is simple. The second version is more complicated. The simplicity is because the first version can be represented as a formula; imagine trying to write out the second version as a formula, and the complexity becomes obvious.
(The addition loop would have to be written as a recurrence relation, which of course means every step depends on the previous step.)
Complexity isn’t always obvious. It’s sometimes common in C programs to write a loop like dst[index++] = src[i], particularly in deeply nested for-loops. In my experience it’s almost always worth rewriting it so that the indices are computable entirely from the loop iteration variables, with no state. It helps you build a mental map of the operations, because you can think of it in terms of geometry (memcpy = copying rectangles onto other larger rectangles) whereas when the index is stateful it becomes quite a lot harder to visualize in higher dimensions. At least for me.
We’ve been building a compiler at Groq for our custom ML hardware. (Roughly, “turn pytorch into our ISA with no extra programmer effort.”) I used to think of posts like this as “Well, modern compilers are so fast, who really cares about autovectorization?” — it turns out you care when you need to write your own compiler. :)
MLIR is pretty cool. It makes a lot of transformations like this pretty easy. The MLIR “krnl” dialect can also automatically transform nested loops into tiled iteration. Graphics devs will know what I mean — no need to loop over 8x8 blocks of pixels manually, just write “for x in range(width): for y in range(height): …” and set the block size to 8.
magicalhippo|3 years ago
Reminded me of when I tried to optimize some code using assembly on x86, maybe 8-10 years go. The compiler spit out a DIV [EDI] instruction which was inside a hot loop. I saw how I could easily rewrite it to use a register instead of the indirect memory access.
So I did, and it was significantly slower, like 10-15%... I even took care to align the loop target address and so on. If I changed my code to use the memory indirection it was measurably faster.
And with that I decided hand-optimizing x86 assembly was definitely a skill to leave behind.
dreamcompiler|3 years ago
dang|3 years ago
(Submitted title was "Multiplications and 2 additions are faster than 2 additions")
eklitzke|3 years ago
For the first version w/ AVX I get:
$ perf stat ./a.out [-] Took: 225634 ns.
Performance counter stats for './a.out':
For the second version with SSE and the data dependency I get:$ perf stat ./a.out [-] Took: 955104 ns.
Performance counter stats for './a.out':
As you can see the first version gets nearly 3x better IPC (2.21 vs 0.84) and spends half as much time being backend bound.mike_hock|3 years ago
Arwill|3 years ago
arunprakash01|3 years ago
[deleted]
dragontamer|3 years ago
SIMD is very powerful, and modern compilers can sometimes simd-ify your code automatically.
-------
The second code can probably become SIMD as well, but it's beyond GCC's ability to autovectorizer it in that form. I kinda want to give it a go myself but don't have time today...
Autovectorizers are mysterious, often harder to see and use than explicitly SIMD code (like OpenCL or CUDA).
loeg|3 years ago
MauranKilom|3 years ago
Usually, for floating point operations the compiler simply has no chance to do anything clever. NaNs, infinities and signed zero mean that even the most "obvious" identities don't actually hold. For example, x + 0 == x (where the == is bitwise) does not hold for x = -0.
parenthesis|3 years ago
mgaunard|3 years ago
zeusk|3 years ago
rasz|3 years ago
ascar|3 years ago
That's not the tldr. It's actually more wrong than right.
The actual TLDR is: loop dependencies prohibit the compiler _and_ your CPU from parallelizing. The SIMD part is the compiler, the more important part here is the CPU though.
Long explanation:
Let's start with the compiler. Yes, the first version can be vectorized and you can see that in the included screenshots of the assembly output. The use of addpd [0] (add 2 pairs of 64bit doubles in simultaneously) instead of addsd (add 1 pair of 64bit doubles). Both operations have identical throughput/latency on any architecture not too ancient (e.g. check information of the corresponding intrinsic here [3]. Unfortunately Agner instruction_tables doesn't include the ADDSD for most architectures.) So while you can crunch 2 operations using SIMD at the same time, you are still doing 3 multiplications and 2 additions instead of 2 additions. The multiplications and addition have the same throughput at least on Skylake. So we can use simple math and 5/2 is still more than 2!
Now to the CPU part. The loop dependency prohibits the compiler to properly utilize instruction pipelining [4] and now differences in throughput vs latency come into play. In a pipelined scenario the CPU can work on multiple steps of the instruction cycle in parallel (from fetching the instruction, then decoding it, to executing the operation and eventually writing the result back to the register) and thus achieve higher throughput. However, if your next instruction depends on the instruction before the CPU has to finish all the steps of a pipeline from instruction start to finish before the next instruction can start. This is the latency of an instruction. For example mulsd/mulpd have 8 times higher latency than throughput on Intel Skylake.
So while SIMD comes in at a factor of 2, the loop dependency stops the CPU from realizing a potential factor of 8.
Peter Cordes also correctly points this out in his comments and answer to the question on stackoverflow.
[1] https://www.felixcloutier.com/x86/addpd
[2] https://www.felixcloutier.com/x86/addsd
[3] https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
[4] https://en.wikipedia.org/wiki/Instruction_pipelining
mvuksano|3 years ago
I feel like last 110-15 years (majority of) people have stopped thinking about specific CPU and only think about ISA. That works for a lot of workloads but in recent years I have observed that there is more and more interest in how specific CPU can execute code as efficiently as possible.
If you're interested in the kind of optimizations performed in the example you should check out polyhedral compilation (https://polyhedral.info/) and halide (https://halide-lang.org/). Both can be used to speed up certain workloads significantly.
manholio|3 years ago
im3w1l|3 years ago
sounds|3 years ago
oconnor663|3 years ago
ummonk|3 years ago
someweirdperson|3 years ago
Just because the algebra works it doesn't guarantee that the code does, too.
ascar|3 years ago
E.g. if you use doubles for the range of 32bit integers even adding 1.0 2 billion times to 2 billion still ends up at 4 billion. Even adding 1.0 20 billion times to 20 billion ends up at 40 billion. Now adding 0.1 20 billion times to 2 billion ends up 1908 short on my CPU, i.e. about 19080 absorptions/rounding errors occured. You need some serious differences and amount of operations to actually trigger errors.
sray|3 years ago
LAC-Tech|3 years ago
foxes|3 years ago
savant_penguin|3 years ago
chaboud|3 years ago
As a performance rule on modern processors: avoid using the result of a calculation as long as you reasonably can (in tight loops... You don't want to be out of cache.).
Have fun threading the needle!
dento|3 years ago
varajelle|3 years ago
The power consumption is a good question.
kllrnohj|3 years ago
ww520|3 years ago
ummonk|3 years ago
mgaunard|3 years ago
You'd need to parallelize it explicitly (which can be done by just unrolling the loop).
ribit|3 years ago
Second: why are so many people insisting that the loop is auto-vectorised? Is there any evidence to that? Data dependencies alone explain the observed performance delta. Auto-vectorization would have resulted in a higher speedup.
tomxor|3 years ago
The suggested "slower" optimisation does fundamentally use less instructions. Chucking more hardware at parallelisable problems makes it run faster but does not necessarily reduce the power requirements much because there are fundamentally the same number of instructions, it's just gobbling up the same power over a shorter period of time - The "slower" serial algorithm uses less instructions and in theory less power in total. Disclaimer: I mean power vs speed fundamentally in theory, in practice there may be no measurable difference depending on the particular CPU and code.
ascar|3 years ago
While it might sound intuitive that SIMD instructions consume more power, I don't think that's necessarily true to a relevant degree in practice. My understanding is CPU power consumption is mostly tied to inefficiences that cause energy loss via heat, while the actual computation doesn't consume any energy per se. So electrons traveling a more complex path probably cause somehwat more energy loss as there is more wire/transistors to pass. But most of the total loss doesn't actually occure in the ALU. Empirically from what you can see operating systems do, the most effective way of consuming less power is actually running on a slower clock cycle and the most effective way to achieve that is getting work done faster and that's not tied to the number of instructions.
The Stackoverflow question here [1] seems to suggest that SIMD vs no SIMD has a neglectable overhead compared to entering a lower power state sooner.
[1] https://stackoverflow.com/questions/19722950/do-sse-instruct...
tylerhou|3 years ago
janwas|3 years ago
Even in this example, which apparently has 4x vector instructions vs 1x scalar, AVX-512 (and probably even AVX2) would reduce energy usage because they can do 8 (or 4 for AVX2) 64-bit elements per cycle.
gpderetta|3 years ago
tpolzer|3 years ago
I.e. the CPU doesn't use much less power if most ALUs are idle, so you might as well use all of them to compute redundant results -> and then turn ~everything off sooner.
unknown|3 years ago
[deleted]
andyjohnson0|3 years ago
And how far would an optimiser typically take its analysis? For example, if B was defined inside the loop as (non -const) A * 2 ? Or as A * f() , where f() always returns 2, maybe using a constant or maybe a calculation etc.
Seems like a very hard problem,
benreesman|3 years ago
Is anyone aware of good tooling for automatically catching this sort of thing even some of the time?
slaymaker1907|3 years ago
ummonk|3 years ago
btdmaster|3 years ago
rasz|3 years ago
shp0ngle|3 years ago
And... the two codes runs with exactly the same speed, on M1 Mac, with go.
edit: of course, with go, you can manually parallelize the faster option with goroutines... but that does something else, doesn't it. (and it's 500x faster.)
hayley-patton|3 years ago
oezi|3 years ago
LAC-Tech|3 years ago
unknown|3 years ago
[deleted]
lawrenceyan|3 years ago
(Super low energy processors, battery powered, etc?
pjscott|3 years ago
desperate|3 years ago
Beltiras|3 years ago
readams|3 years ago
Retr0id|3 years ago
AtNightWeCode|3 years ago
The general advice when I worked with CG was to use deltas in iterations.
paradite|3 years ago
garethrowlands|3 years ago
dgb23|3 years ago
I'm taking the definition of simple and complex (or complected) from this talk[0]. In short: Simple means individual pieces are standing on their own, not necessarily easy or minimal. Complex means individual pieces are braided together into a whole.
The first code is simpler than the second. Just have a look at the two solutions and you see what I mean: The individual instructions in the first stand on their own and clearly declare what they mean as well. The second example is more complex, because each line/subexpression cannot be understood in isolation, there is an implicit coupling that requires the programmer _and_ the machine code interpreter to understand the whole thing for it to make sense. The CPU apparently fails at that and cannot optimize the machine instructions into more efficient microcode.
The example illustrates performance benefits of simpler code, but there are others too:
For example a type checker might be able to infer things from simpler code, but not from complex code. Again, complexity is about coupling, which can for example arise from making assumptions about things that are out of bounds or logic that happens in some other part of your program, which _this_ part relies on.
Things like these can either be outright rejected by a type checker or simply ignored and sometimes we can provide additional ceremony to make it happy. But there is always an underlying question when these issues arise: Can I express this in a simpler way? It's sometimes possible to describe rich semantics with simpler, more primitive types and explicit coordination.
Another thing that comes to mind are rich ORMs. They can be very convenient for common cases. But they have footguns for the general case, because they _complect_ so many things, such as validation, storage, domain rules, caching etc. And they are leaky abstractions because we have to do certain things in certain ways to avoid bloated SQL, N+1 etc.
They are not simple (and certainly not declarative) so we program against a black box. There are simpler "ORMs" that I very much like, but they typically only provide a minimal set of orthogonal features, such as query builders, and mapping results into plain data structures. It is simpler to use these kind of orthogonal tools.
Last but not least: Simple code can be pulled apart and changed more easily without having to worry about introducing problems. The substitution rule or referential transparency enables this by guaranteeing that each expression can be replaced by the value it will evaluate to. This also implies that other expressions that evaluate to the same value can be substituted freely.
[0] Simple Made Easy - Rich Hickey at Strangeloop 2011 https://www.youtube.com/watch?v=LKtk3HCgTa8
djmips|3 years ago
ummonk|3 years ago
diarrhea|3 years ago
layer8|3 years ago
In the 2 additions version, computation of the next iteration depends on the results of the preceding iteration. In the multiplication version, the computations are independent for each iteration, enabling parallel execution (by SIMD and/or pipelined/superscalar execution).
mlatu|3 years ago
but im to lazy to test that
arunprakash01|3 years ago
[deleted]