Test for lists in Cython

[+] jakobnissen|5 years ago|reply

I'm a massive Julia fanboy, but I would not extend Python with Julia if I could choose not to. Julia has a _massive_ runtime with a hello-world script consuming 150 MB of RAM, not to mention the dreaded startup-time.

It's better and easier to use Julia as your main top-level "glue" language and call Python/Rust/C from Julia. Julia is in many ways a better glue language than Python - better multithreading, easier calling into C/Rust, etc. Then, over time, to the extent it is practical, you can replace foreign code with Julia code - because Julia is fast enough that it actually makes sense to do so.

If you already have a large Python code base and can't switch the top-level language to Julia, I would just not use Julia for that project.

[+] O_H_E|5 years ago|reply

> _massive_ runtime ... 150MB of RAM, not to mention the dreaded startup-time.

I understand how can this be inconvenient for me and you while scripting. But this is absolutely no problem for julia's currently main target market who runs multi-GB simulations.

Also it is not like they will be spawning a julia instance in a hot loop (THAT would be horrible). You write some julia code, and import it using smth like pyjulia at the beginning of the file.

[+] galangalalgol|5 years ago|reply

You seem to understand Julia well. Is there a reason there is no nexus plugin or way to mirror the julia repo for dev networks that dont have unrestricted access to the internet, or are even airgapped. I see multiple people asking this online, so it is a common enough problem, but it seems like people are saying the Julia approach makes it hard to support.

[+] unknown|5 years ago|reply

[deleted]

[+] dr_zoidberg|5 years ago|reply

The cython code is a bit messy. Changing from:

    cpdef float iterate_list(a_list):

        cdef double count = 0
        cdef int i, j
        for i in range(len(a_list)):
            internal_list = a_list[i]
            for j in range(len(internal_list)):
                count += internal_list[j]
        print(count)
        return count

To:

    cpdef float iterate_list(list a_list):

        cdef double count = 0
        cdef double val = 0
        cdef list ilist
        for ilist in a_list:
            for val in ilist:
                count += val
        print(count)
        return count

Speeds up the iterate_list function an order of magnitude. On my PC:

    In [9]: %timeit list_cy.iterate_list(a_list)
    1000000.0007792843
    1000000.0007792843
    1000000.0007792843
    1000000.0007792843
    1000000.0007792843
    1000000.0007792843
    1000000.0007792843
    1000000.0007792843
    385 ms ± 6.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    In [10]: %timeit list_cyo.iterate_list(a_list)
    1000000.0007792843
    1000000.0007792843
    1000000.0007792843
    1000000.0007792843
    1000000.0007792843
    1000000.0007792843
    1000000.0007792843
    1000000.0007792843
    2.71 s ± 182 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(yeah, I kept the prints that the code has)

Where list_cy is the fixed code and list_cyo is the original code. Even then, iterating over a list of lists is _definitely not_ the optimal way you'd face a problem of this kind. Numpy arrays and memoryviews would be the correct tool to use.

[+] optimalsolver|5 years ago|reply

Thanks for this.

Will Julia advocates every use honest benchmarks to make their language look good? I doubt it.

[+] ZuLuuuuuu|5 years ago|reply

I feel like the #1 downside of Python for the last few years is that you cannot take advantage of multiple cores of a CPU easily. Especially when you think it is heavily used in data analysis. We use Python for data analysis as well, and for 95% of operations we are doing, numpy is fast enough that we don't have any complaints. But sometimes, we do wish to be able to take advantage of all the cores in our CPUs, especially now that we can easily get an 8 core CPU for a reasonable price.

There is multiprocessing module but you cannot share memory between processes. I guess the best options are either writing a C extension or using Numba. Writing C extensions require either distributing binary packages or C compiler to be present on the target computer which is not always ideal. So is using Numba the best solution currently? I tried it a bit in the past but the errors I got was a bit hard to interpret compared to regular Python errors.

I wish the multithreading module had support for native threads. Is there any PEPs trying to bring easy multi-core support for Python?

[+] szubie|5 years ago|reply

Recent versions of Python (3.8+) have introduced a SharedMemory class for sharing memory between different processes (see https://docs.python.org/3/library/multiprocessing.shared_mem...).

The implementation may still be a bit buggy though, so use with caution! https://bugs.python.org/issue38119

[+] jeremiecoullon|5 years ago|reply

I've been using JAX (https://jax.readthedocs.io/en/latest/) for scientific computing in general (in particular MCMC algorithms), as it's really fast. Even on a CPU you get massive speedups compared to numpy (can be up to 2 or 3 orders of magnitude faster in some cases).

The main selling point of the library is automatic differential and compilation to XLA, but I've been using it even when I don't need gradients, as it's really fast (due to compilation). I also really like the random number generator as it's very good for reproducibility.

I've played around with Julia in the past and really liked it, but in terms speed Jax has pretty much solved that problem for me

[+] pletnes|5 years ago|reply

Numpy operations release the GIL (usually at least) so you can use a threadpool and, indeed, share memory. Just try it and you may be pleasantly surprised.

Dask is great if you’re processing large amounts of data, and it recommends and supports threads for this reason.

[+] yummypaint|5 years ago|reply

What I do is write single-threaded programs that process only a section of data, and manage them with HTCondor. It's some extra overhead and complexity but this way you can easily scale to hundreds of machines without changing anything. You also get mature queuing and job management tools that work independently of your program. If your application is easy to paralellize and you might need more than one machine i highly reccommend this route.

[+] CogitoCogito|5 years ago|reply

> We use Python for data analysis as well, and for 95% of operations we are doing, numpy is fast enough that we don't have any complaints. But sometimes, we do wish to be able to take advantage of all the cores in our CPUs, especially now that we can easily get an 8 core CPU for a reasonable price.

I understand the sentiment, but in this case I wonder if it is less a python issue and more an issue with numpy. I don't see any reason why numpy couldn't execute the last line of the following multi-threaded:

    >>> import numpy as np
    >>> a = np.arange(1000000000)
    >>> b = 2 \* a

The same thing applies to matrix multiplication. I'm guessing numpy has chosen not to for good reason though.

edit: Actually it seems numpy is multi-threaded under certain circumstances:

https://stackoverflow.com/questions/16617973/why-isnt-numpy-...

[+] N1H1L|5 years ago|reply

Try dask

Distribute your data and run everything as dask.delayed and then compute only at the end.

Also check out legate.numpy from Nvidia which promises to be a drop in numpy replacement that will use all your CPU cores without any tweaks on your part.

https://github.com/nv-legate/legate.numpy

[+] zozbot234|5 years ago|reply

> I feel like the #1 downside of Python for the last few years is that you cannot take advantage of multiple cores of a CPU easily.

Also a big downside with JavaScript. Of course both Python and JS are high-level interpreted languages where high-performance use cases aren't the foremost priority.

[+] syntonym2|5 years ago|reply

I use python in a scientific context, but have so far not written much extensions for python in any of the languages tested. I'm interested in some guidance which language a) is easy to integrate with python and b) has some good performance, but this benchmark lacks the details to come to any conclusion.

I tried to run the benchmark on my own computer, but the setup documentation was not enough for me to get the julia integration running. I haven't used julia before, so it might just be something very simple.

Similar I haven't used poetry much before, and the given documentation failed to install the necessary setuptools-rust for me. I could fix it on my own, but doesn't make me feel certain about the outcome of the benchmark.

The rust benchmark did not reproduce for me: "Rust (Pyo3) parallel after" has a 1.56 speedup for me, but a 2.6x slowdown for the author. Also I don't understand what the difference between "after" and "before" is, the code just calls the same code twice. Might be a JIT/Cache thing, but it's unclear to me. One sentence what before/after refers to would be very helpful.

Generally all measurements are only done once. Measuring at least thrice gives one at least a chance to detect an outlier and gives possibility for statistics, e.g. is a difference betwee the different cython annotations even meanginful?

The "C Cython (pure-python mode)" is reported faster then "C Cython (.pyx)". The Cython project itself says that using pyx files should be faster, so something strange is going on.

"Cython is fast, but none of these methods are able to release the GIL. " (A) this is not true (B) this seems to be mostly over single threaded performance, so why is that meaningful?

"Rust is not that fast beacuse it needs to copy data; using Pyo3 objects would probably lead to similar results as cython, but with an added library." The rust code already contains Pyo3, so an "added library" is not necessary as far as I understand.

I'd guess the performance stems more from conversions between different types then anything else. Maybe Julia (and the python-julia bridge) is particular smart about it and thus it's super easy to use, while pyo3 (and cython) needs some more work to interface with python. Even if that is true, I couldn't say it from the presented data.

With these caveats resolved I'd be interested in the benchmark, but without it I can't really say anything from it.

[+] jakobnissen|5 years ago|reply

About two years ago (before I switched from Python to Julia), I was in the same boat as you. What I concluded was:

1) Calling into an actual static language like C or Rust is the best option. You get maximal performance and all the benefits of the static language. The downside is that you need to learn another language, and manage both languages in your project, including setup and compilation of the static language etc.

2) Cython is easiest for small-scale projects, since it integrates very well with Python, and you can learn it incrementally. But I found it annoying to work with - it felt like half a language that fell between Python and a proper static language. I ended up using Cython in the end, but I wasn't happy with it.

3) Numba looks interesting and promising. At least 2 years ago, it was too brittle and had too many situations where it didn't work or didn't give noticable speedups. I'm sure they improved it since then. I would definitely take a look.

You can always just learn Julia of course and have this entire problem of "my high-level language is too slow" completely disappear ;)

[+] boothby|5 years ago|reply

Their .pyx implementation leaves much to be desired. Among other problems, the hot loop uses an untyped python list. Also, they're indexing into the list instead of iterating over the list. In creating the list, they're using .append() instead of list comprehensions.

Fixing those minor issues cuts the runtime in half (see #3). Going for actual high-performance Cython (on my 12-core workstation) cuts the runtime by 20x (see #2).

https://github.com/00sapo/cython_list_test/pull/2 https://github.com/00sapo/cython_list_test/pull/3

[+] short_sells_poo|5 years ago|reply

I believe the toolchain of Rust is nicer in that you get a relatively small sized and self-contained rust library that can be easily distributed with a python package. Julia can't be easily bundled this way because you need to ship the entire runtime, with all the gubbins this entails.

On the other hand: having used pyo3 to integrate rust with python in the past, the biggest pain is simply to reconcile the dynamism of python with AOT compiled rust code. There's a lot of noise at the interface from the large amount of type checking to unpack the specific types of numpy arrays coming across. Do you want to have your rust code work with all sorts of integer bit lengths? That'll be a code path for each. If you have a number of input types, have fun coercing them all.

This can be alleviated with macros to a degree, but that just hides the problem really.

Julia is JIT compiled, which means that there are no problems with types being determined only at run time. In fact, this one fact makes integration at the code level much nicer with Julia.

So all in all it's a tradeoff really. I also found the tooling for Rust to be much more robust and stable (1 year ago admittedly).

[+] unknown|5 years ago|reply

[deleted]

[+] Gravityloss|5 years ago|reply

I was a huge fan of Matlab way back. I wrote a hundred small Matlab programs for usage in the research department of the company I worked in. Doing data operations in Matlab was way more elegant than in say, Numpy which I tried later. Development was fast and ergonomics were good.

After using Ruby for years, returning to Matlab style code in Julia felt somewhat awkward. Instead of my_array.length you have length(my_array). In my personal preference the method call is just a nicer way of doing the same thing. The single-instruction-multiple-data or dot notation sometimes worked and sometimes didn't, so you had to resort to loops anyway.

    % Julia
    a=[1 2 3 4]
    a.^2
    log.(a)

    # Ruby
    a=[1,2,3,4]
    a.map{|element| element.pow(2)}
    a.map{|element| Math.log(element)}

Ruby and map or each have their verbosity but overall it feels a more robust "hammer" for general programming tasks. On the other hand, Julia can be really dense and still easy to understand.

Other languages of course take some of these things even further. Maybe some day I will find an ergonomics and nicety first successor to both.

Plotting with two Y axes or generating histograms in Julia was also way harder than I remember it being in Matlab. Also manually having to load the file before every run to see the changes in action added a lot of overhead to the workflow.

The workflow I was used to in Matlab involved very frequent making changes to code, running of the code that usually made some plot. Which is one command in Matlab, and plotting was fast, in 2001 already, on Windows NT 4. In Julia, you have to first load the modified file, then run it. Plotting takes a long time. One just can't get nearly as productive with it in 2021, compared to Matlab of 2001 vintage.

What language would I pick if I had to do some quick analysis from some tables downloaded from the internet? Probably Julia still. If I had free access to Matlab, I would probably use it though.

[+] vchuravy|5 years ago|reply

Broadcasting and map are two different operations. If all the inouts have the same shape broadcast is equivalent to map, but in Julia you can also just use map.

`map(el->el^2, a)`

and inspired by Ruby

``` map(a) do element log(element) end ```

The latter being syntax sugar for the former.

[+] jhgb|5 years ago|reply

> Instead of my_array.length you have length(my_array). In my personal preference the method call is just a nicer way of doing the same thing.

Well...they're both method calls, aren't they? (So they're both the nicer way?)

[+] leephillips|5 years ago|reply

The standard way to make a histogram in Julia is histogram(data)

Using the latest version (1.6 - although 1.6.1 just came out) the time to first plot is just a few seconds. After that, plotting in the REPL is instantaneous.

I probably don’t understand what you’re getting at when you speak of making frequent changes to code. REPL-based development in Julia is excellent, and there are Pluto notebooks as well.

[+] urschrei|5 years ago|reply

Rust doesn’t need to copy the data. It’s trivial to pass e.g. Numpy arrays to Rust as slices via Cython (let alone originating in Cython!), modify them, and return them, or use them as input for a new returned struct.

https://github.com/urschrei/simplification

https://github.com/urschrei/lonlat_bng

https://github.com/urschrei/pypolyline

Each of those repos has links to the corresponding Rust “shim” libraries that provide FFIs for dealing with the incoming data, constructing Rust data structures from it, and then transforming it back on the way out.

As a more general comment, using a GC language as the FFI target from a GC language is begging for difficult-if-not-impossible-to-debug crashes down the line.

[+] edenhyacinth|5 years ago|reply

Given that it's via pyO3, you could even pass the numpy arrays using https://github.com/PyO3/rust-numpy and get ndarrays at the other side.

Same no copy, slightly more user friendly approach.

Further criticism of the actual approach - even if we didn't do zero copy, there's no preallocation for the vector despite the size being known upfront, and nested vectors are very slow by default.

So you could speed up the entire thing by passing it to ndarray, and then running a single call to sum over the 2D array you'd find at the other end. (https://docs.rs/ndarray/0.15.1/ndarray/struct.ArrayBase.html...)

[+] bjourne|5 years ago|reply

> As a more general comment, using a GC language as the FFI target from a GC language is begging for difficult-if-not-impossible-to-debug crashes down the line.

Not true!

What you do is that you keep a registry for objects passed from the host vm to the foreign vm in which you register objects thus transferred. And you use a similar mechanism for objects passed from the foreign vm to the host vm. In CPython, you simply increment the refcount to prevent Python from collecting them prematurely.

This is how Java does it (via JNI) and how many gc:ed runtimes interact with other gc:ed runtimes. This is how you do it in Rust too since Rust can't tell how long an object passed to a foreign vm is supposed to last.

[+] mhh__|5 years ago|reply

> As a more general comment, using a GC language as the FFI target from a GC language is begging for difficult-if-not-impossible-to-debug crashes down the line.

When I was interfacing D code with a part of Unreal Engine that I think is garbage collected, I actually just took the L and copied everything into buffers on their malloc when handing stuff off to the engine. It wasn't particularly hot code so the memcpys were worth the peace of mind I found, ugly as it was.

[+] volta83|5 years ago|reply

To add, it is trivial to guarantee that Rust code is "zero-copy", so if this is something you care about, Rust allows your program to fail to compile if it tries to make a copy.

[+] whateveracct|5 years ago|reply

> As a more general comment, using a GC language as the FFI target from a GC language is begging for difficult-if-not-impossible-to-debug crashes down the line.

Safe interop between two GC'd languages (Haskell and Java) was one motivation of Haskell's new -XLinearTypes extension

https://www.tweag.io/blog/2020-02-06-safe-inline-java/

Generally, the new linear types are exciting but are nascent at the moment. But one extension to them (linear constraint [1]) seems to allow embed the equivalent of Rust's ownership in Hsakell using more primitive features in the type system (to my understanding..)

[1] https://arxiv.org/pdf/2103.06127.pdf

[+] unknown|5 years ago|reply

[deleted]

[+] duckerude|5 years ago|reply

I'm also getting a 30% speedup simply from manipulating &PyList references instead of Vecs (without parallelism).

[+] chalst|5 years ago|reply

There is no problem at all interfacing Julia's C API to Rust. It's a shame that C++ and Rust are bad fits together, but this actually strengthens the argument for Julia as the extension language rather than Rust, since Julia interfaces easily to not just C but also the C++ code in which much of the world's best numerical algorithms are written, and has an unrivalled FFI to Python, while Rust's object system is a poor fit for either Python or C++.

> As a more general comment, using a GC language as the FFI target from a GC language is begging for difficult-if-not-impossible-to-debug crashes down the line.

<s>As for the FUD about interfacing JIT code to C/C++, this is a problem Julia was designed from the outset to tackle. Incidentally, Julia built on the excellent experience LuaJIT has had. A challenge: can you name a particular extension that would interface better from Rust than from Julia?</s>

Oops, I misunderstood the criticism you were hinting at but failing to justify, sorry. OK, Julia already does interface to Python and this interface sees widespread use. If your suspicion is right, then where are the horror stories from people who were bitten when deploying code that built on the interface?

[+] Nimitz14|5 years ago|reply

one word: pybind11

I honestly think people are severely underestimating what a massive impact this is currently having in increasing productivity of python devs who know a little C++.

[+] otabdeveloper4|5 years ago|reply

I concur, pybind11 is the smoothest Python extension story I've experienced so far.

[+] adsharma|5 years ago|reply

py2many doesn't care which language is faster as long as the source language is annotated python3.

I don't know if this particular benchmark transpiles correctly or not, but it should be possible to achieve much speedup if you annotate your python code properly.

https://github.com/adsharma/py2many

Looking for help with open issues.

[+] adsharma|5 years ago|reply

./py2many.py --julia=1 list_py_annotations.py

produces

https://paste.ubuntu.com/p/vwxYkppRCw/

[+] boothby|5 years ago|reply

This is largely a test of cython, but why doesn't it use the cython features comparable to Julia's? Specifically, you can achieve parallelism with prange(..., nogil=True) and no-copy with cython.view.array.

https://github.com/00sapo/cython_list_test/pull/2

[+] syzygyhack|5 years ago|reply

Why not Nim?

[+] jasfi|5 years ago|reply

Nim should be more popular, and I believe it will be in time.

[+] BiteCode_dev|5 years ago|reply

I never considered Julia to extend python but it makes sense.

Although as mentioned in the benchmark, with py2o, rust will be a superb contender, especially considering:

- the toolchain is the easiest to setup

- you can embed asm if you really need this extra juice

[+] olliemath|5 years ago|reply

Pure python - under pypy - on my system faster than all of the above :D

[+] N1H1L|5 years ago|reply

Until Cython can handle real scientific stuff like FFTs, or curve fitting natively, it's useless and good only for toy problems.

[+] ur-whale|5 years ago|reply

From the benchmarks:

> Rust is not that fast because it needs to copy data;

I'm surprised.

Don't know much about Rust, but isn't it hailed as being competitive with C/C++ ?

[+] yakubin|5 years ago|reply

You left out the explanation from your quote. The full is:

> Rust is not that fast beacuse it needs to copy data; using Pyo3 objects would probably lead to similar results as cython, but with an added library.

"It needs to copy data", because it's converting Python objects into Rust objects and back again. As the full quote states, it could be written in a different way, although that would make the code look really weird.

In this benchmark there is also no normal C++ involved. There is only Cython using some C++ ints, but operating on Python lists. And that's not handled by writing C++ code that operates on Python structures, but writing Python code in Cython in a "C++ mode".

Apples to oranges.

[+] unknown|5 years ago|reply

[deleted]

[+] mhh__|5 years ago|reply

You mean C++.

C usually comes out a bit worse because it gives even less information to the optimizer.

And thanks to the GNU/LLVM monoculture most languages should perform roughly the same if they have them as a target e.g. I have found that D makes its it ridiculously easy to write highly specialized code that is both readable and visible to the optimizer (I will be blogging about that part later), but I'm sure I could force the same asm out from C++ or Rust etc.

[+] unknown|5 years ago|reply

[deleted]

[+] PartiallyTyped|5 years ago|reply

The culprit here is `a_list: Vec<Vec<f64>>`, which means the list of lists is copied as a whole, hence why the rust version is slower.

In general, rust is at least as fast as C or C++, but this seems to be a special case.

[+] iExploder|5 years ago|reply

rust has concept of references, maybe it was referring to its particular implementation of the problem

[+] sapo|5 years ago|reply

When considering python standard types and ease of use, such as lists and dicts

136 comments