top | item 39011458

(no title)

billyzs | 2 years ago

> a2, c = a2+c, c+2 > is faster than > a2 += c > c += 2 > My guess is that in the first case the two evaluations and assignments can happen in parallel, and so may happen on different cores

Not sure I follow, isn't Python single threaded by default? Changes to GIL is coming but does it change how the interpreter uses CPU?

discuss

order

kristofferc|2 years ago

Yeah, this isn't benchmarking anything related to the CPU etc. It is benchmarking the quirks of the Python interpreter.

danybittel|2 years ago

There is instruction level parallelism in modern CPUs. They have multiple "calculation units", that do for example addition. If one doesn't depend on the other they get executed at the same time.

gpderetta|2 years ago

But there is no dependency on the expressions on either variant, so there is no reason why the first variant is faster than the second in principle (of course python internals will get in the way and make it hard to reason about performance at all).

epcoa|2 years ago

LOL. The amount of machinery going on under the hood in evaluating those expressions in CPython is staggering. A microscopic detail like a single instruction data dependency has nothing to do with it. (How many CPU add instructions are executed just for those statements? Probably hundreds.)

This is much more likely a quirk of the interpreter (or possibly a fucked up test). CPU details are like 10000 feet down.

ufo|2 years ago

It wouldn't run in separate cores but single-threaded can also get some measure of instruction-level parallelism.

A CPU can do more than one thing at once by computing the next instruction while it's still writing the result of the previous one. However, the CPU can only do that if it's 100% sure that the next instruction does not depend on the previous instruction. This optimization sometimes can't trigger in an interpreter, because of highly mutable variables such as the program counter or the top of the interpreter stack. Fun illustration: https://www.youtube.com/watch?v=cMMAGIefZuM&t=288s

saagarjha|2 years ago

Running that across two cores would be a slowdown, not a speedup. You cannot parallelize work like this, because it's too small to be worth it.

kunley|2 years ago

The evaluations don't magically/implicitly happen on many cores.

I guess the first thing worth doing when analyzing this would be looking at the differences in the bytecode, then looking at the C code implementing the differing bytecode ops. But there also other factors, like the new adaptive interpreter trying to JIT the code.

zimpenfish|2 years ago

For Python 3.12, Godbolt gave almost identical bytecode for both (albeit in different order.) I'm guessing wildly but might this be because `BINARY_OP(+=)` stores the result (because it's `INPLACE`) and then you also do `STORE_FAST(x)` which gives you two stores for the same value compared with one store in the single-line version?

Single-line dual assignment:

    2         2 LOAD_FAST                0 (a2)
              4 LOAD_FAST                1 (c)
              6 BINARY_OP                0 (+)
             10 LOAD_FAST                1 (c)
             12 LOAD_CONST               1 (2)
             14 BINARY_OP                0 (+)
             18 STORE_FAST               1 (c)
             20 STORE_FAST               0 (a2)
vs the two-line version:

    2         2 LOAD_FAST                0 (a2)
              4 LOAD_FAST                1 (c)
              6 BINARY_OP               13 (+=)
             10 STORE_FAST               0 (a2)

    7        12 LOAD_FAST                1 (c)
             14 LOAD_CONST               1 (2)
             16 BINARY_OP               13 (+=)
             20 STORE_FAST               1 (c)