top | item 43593810

(no title)

exyi | 11 months ago

It is! Although my test case is probably an unrealistically bad scenario:

It's the classic, why is processing sorted array faster than unsorted one

    def f(arr):
        r = True
        for x in arr:
           if x > 128: r = not r
        return r

    arr = [ random.randint(0, 255) for i in range(0, 1000_000) ]
    arr_sorted = list(sorted(arr))

    %timeit f(arr)
    %timeit f(arr_sorted)

Results are (on my machine): 17.5 ms for unsorted, and 13.5 ms for sorted. For comparison, in a compiled language, the difference is 4x

discuss

order

ammar2|11 months ago

Edit: Analyzed the wrong thing earlier.

This depends on the Python version, but if it has the specializing interpreter changes, the `COMPARE_OP` comparing the integers there is probably hitting a specialized `_COMPARE_OP_INT` [1].

This specialization has a ternary that does `res = (sign_ish & oparg) ? PyStackRef_True : PyStackRef_False;`. This might be the branch that ends up getting predicted correctly?

Older versions of Python go through a bunch of dynamic dispatch first and then end up with a similar sort of int comparison in `long_richcompare`. [2]

[1] https://github.com/python/cpython/blob/561965fa5c8314dee5b86...

[2] https://github.com/python/cpython/blob/561965fa5c8314dee5b86...

dzaima|11 months ago

This isn't actually timing the sorting, but just the (dumb) function f.

LPisGood|11 months ago

This is a really good example. It is more like branch prediction than standard data/instruction caching.

I wonder if you could do Spectre type vulnerabilities in python. You would need some way to leak micro-architectural state, so without being particularly clever maybe python code could only be used as a gadget or something.

cma|11 months ago

Python speed up is probably from small integer caching, a sorted array will have runs of pointers to the same integers adjacent. The compiled language one is probably branch prediction right?

exyi|11 months ago

I intentionally stayed in the small integer range to avoid benchmarking the cache. 256 distinct values should fit into L1 just fine in both cases.

I'm now thinking that the difference might be even larger if we instead avoid small integers and let the CPU get stuck chasing pointers. The idea is that it gets stuck on a memory access, which forces it to speculate much further, which in turn makes it backtrack a longer path if a branch was mispredicted. I'm obviously no expert on this, feel free to correct me

The results for 1B range instead of 255 are 17.6 ms for unsorted / 68.2 ms for sorted! We are back to what the original article observed and it's a way stronger effect than what branch prediction can offer. So don't sort your arrays, keep them in the order the boxed values were allocated ;)

bgwalter|11 months ago

That seems very likely. The benchmark should probably use a range that is guaranteed to be outside of the cached smallint range.