Making Python faster with Rust

Good for you! You did everything right: measure always, fix the bottleneck if possible, rewrite if necessary.

A little tip, you don't have to compare actual distances, you can compare squared distances just as well. Then in `norm < max_dist`, you don't have to do a `sqrt()` for every `norm`. Saves a few CPU ticks as well.

I once rewrote a GDI+ point transformation routine in pure C# and got 200x speedup just because the routine was riddled with needless virtual constructors, copying type conversions, and something called CreateInstanceSlow. Ten years after, I gathered a few of these anecdotes and wrote the Geometry for Programmers book (https://www.manning.com/books/geometry-for-programmers) with its main message: when you know geometry behind your tools, you can either use them efficiently, or rewrite them completely.

masklinn|2 years ago

The author did talk about it on reddit, but explained that for the purpose of the blog post they wanted to focus on the big stuff and profiling-guided optimisation of the process: https://reddit.com/r/rust/comments/125pbq0/blog_post_making_...

ZeroCool2u|2 years ago

The ignore the square root while computing/comparing distances trick is a great one. That's how I got to the top of the performance leaderboard in my first algorithms class.

appeldorian|2 years ago

I think a big mistake in the article, in a context where performance is the main objective, is that the author uses an array of structs (AoS), rather than a struct of arrays (SoA). An SoA makes it so that the data is ordered contiguously, which is easy to read for the CPU, while an AoS structure interleaves different data (namely the x and y in this case), which is very annoying for the CPU. A CPU likes to read chunks of data (for example 128 bits of data/read) and to process these with SIMD instructions, executing a multiple of calculations with one CPU cycle. This is completely broken when using an array of structs.

He uses the same data structure in both the Python and Rust code, so I imagine that he can get an extra 4x speedup at least if he rewrites his code with memory layout in mind.

tremon|2 years ago

Apache Arrow (https://arrow.apache.org/overview/) is built exactly around this idea: it's a library for managing the in-memory representation of large datasets.

ohr|2 years ago

Author here: I agree, that's a great perf advice (esp. when you can restructure your code).

I couldn't get into a this in the article (would be too long), but this is a great point and the original library does this in a lot of places.

One problem in our use case is that the actual structs members are pretty big & that we need to group/regroup them a lot.

The fastest approach for us was to do something like in the article for the initial filtering, then build a hashmap of SoAs with the needed data, and do the heavier math on that.

Yoric|2 years ago

Having used this approach in a few languages, I agree that it's (sometimes) (much) better for performance, but it tends to wreak havoc on readability.

prirun|2 years ago

Modern CPU caches are usually loaded in 64-byte units - much larger than 128 bits. I just ran some tests with a C program on an Intel I5 with both AoS and SoA using a list of 1B points with 32-bit X and Y components. Looping through the list of points and totaling all X and Y components was the same speed with either AoS or SoA.

It's easy to make intuitive guesses about how things are working that seem completely reasonable. But you have to benchmark because modern CPUs are so complex that reasoning and intuition mostly don't work.

Programs used for testing are below. I ran everything twice because my system wasn't always idle, so take the lower of the 2 runs.

  [jim@mbp ~]$ sh -x x
  + cat x1.c
  #include <stdio.h>

  #define NUM 1000000000

  struct {
    int x;
    int y;
  } p[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      p[i].x = i;
      p[i].y = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += p[i].x + p[i].y;
    }
    printf("s=%d\n", s);
  }
  + cc -o x1 x1.c
  + ./x1
  s=1808348672

  real 0m12.078s
  user 0m7.319s
  sys 0m4.363s
  + ./x1
  s=1808348672

  real 0m9.415s
  user 0m6.677s
  sys 0m2.685s
  + cat x2.c
  #include <stdio.h>

  #define NUM 1000000000

  int x[NUM];
  int y[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      x[i] = i;
      y[i] = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += x[i] + y[i];
    }
    printf("s=%d\n", s);
  }
  + cc -o x2 x2.c
  + ./x2
  s=1808348672

  real 0m9.753s
  user 0m6.713s
  sys 0m2.967s
  + ./x2
  s=1808348672

  real 0m9.642s
  user 0m6.674s
  sys 0m2.902s
  + cat x3.c
  #include <stdio.h>

  #define NUM 1000000000

  struct {
    int x;
    int y;
  } p[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      p[i].x = i;
    }
    for (i=0; i<NUM; i++) {
      p[i].y = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += p[i].x;
    }
    for (i=0; i<NUM; i++) {
      s += p[i].y;
    }
    printf("s=%d\n", s);
  }
  + cc -o x3 x3.c
  + ./x3
  s=1808348672

  real 0m13.844s
  user 0m11.095s
  sys 0m2.700s
  + ./x3
  s=1808348672

  real 0m13.686s
  user 0m11.038s
  sys 0m2.611s
  + cat x4.c
  #include <stdio.h>

  #define NUM 1000000000

  int x[NUM];
  int y[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++)
      x[i] = i;
    for (i=0; i<NUM; i++)
      y[i] = i;

    s=0;
    for (i=0; i<NUM; i++)
      s += x[i];
    for (i=0; i<NUM; i++)
      s += y[i];

    printf("s=%d\n", s);
  }
  + cc -o x4 x4.c
  + ./x4
  s=1808348672

  real 0m13.530s
  user 0m10.851s
  sys 0m2.633s
  + ./x4
  s=1808348672

  real 0m13.489s
  user 0m10.856s
  sys 0m2.603s

b0b10101|2 years ago

This is a great article but there's still a core problem there - why should developers have to choose between accessibility and performance?

So much scientific computing code suffers between core packages being split away from their core language - at what point do we stop and abandon python for languages which actually make sense? Obviously julia is the big example here, but its interest, development and ecosystem doesn't seem to be growing at a serious pace. Given that the syntax is moderately similar and the performance benefits are often 10x what's stopping people from switching???

fbdab103|2 years ago

Today, there is a Python package for everything. The ecosystem is possibly best in class for having a library available that will do X. You cannot separate the language from the ecosystem. Being better, faster, and stronger means little if I have to write all of my own supporting libraries.

Also, few scientific programmers have any notion of what C or Fortran is under the hood. Most are happy to stand on the shoulders of giants and do work with their specialized datasets. Which for the vast majority of researchers are not big data. If the one-time calculation takes 12 seconds instead of 0.1 seconds is not a problem worth optimizing.

bakuninsbart|2 years ago

Because professional software developers with a background in CS are a minority of people who program today. The learning curve of pointers, memory-allocation, binary operations, programming paradigms, O-Notation and other things you need to understand to efficiently code in something like C is a lot to ask of someone who is for example primarily a sociologist or biologist.

The use case btw. is often also very different. In most of academia, writing code is basically just a fancy mode of documentation for what is basically a glorified calculator. Readability trumps efficiency by a large margin every time.

brahbrah|2 years ago

They would have gotten the same performance in python with numpy if they did it like this instead of calling norm for every polygon

centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)

They were just using numpy wrong. You can be slow in any language if you use the tools wrong

ThouYS|2 years ago

everything. why are there still cobol programmers? why is c++ still the defacto native language (also in research)?

but also I don't see any problem there, I think the python + c++/rust idiom is actually pretty nice. I have a billion libs to choose from on either side. Great usability on the py side, and unbeatable performance on the c++ side

ubj|2 years ago

One of Julia's Achilles heels is standalone, ahead-of-time compilation. Technically this is already possible [1], [2], but there are quite a few limitations when doing this (e.g. "Hello world" is 150 MB [6]) and it's not an easy or natural process.

The immature AoT capabilities are a huge pain to deal with when writing large code packages or even when trying to make command line applications. Things have to be recompiled each time the Julia runtime is shut down. The current strategy in the community to get around this seems to be "keep the REPL alive as long as possible" [3][4][5], but this isn't a viable option for all use cases.

Until Julia has better AoT compilation support, it's going to be very difficult to develop large scale programs with it. Version 1.9 has better support for caching compiled code, but I really wish there were better options for AoT compiling small, static, standalone executables and libraries.

[1]: https://julialang.github.io/PackageCompiler.jl/dev/

[2]: https://github.com/tshort/StaticCompiler.jl

[3]: https://discourse.julialang.org/t/ann-the-ion-command-line-f...

[4]: https://discourse.julialang.org/t/extremely-slow-execution-t...

[5]: https://discourse.julialang.org/t/extremely-slow-execution-t...

[6]: https://www.reddit.com/r/Julia/comments/ytegfk/size_of_a_hel...

shakow|2 years ago

IME, for having used Julia quite extensively in Academia:

- the development experience is hampered by the slow start time;

- the ecosystem is quite brittle;

- the promised performances are quite hard to actually reach, profiling only gets you so far;

- the ecosystem is pretty young, and it shows (lack of docs, small community, ...)

> what's stopping people from switching???

All of the mentioned above, inertia, perfect is the enemy of good enough, the alternatives are far away from python ecosystem & community, performances are not often a show blocker.

ActorNightly|2 years ago

I don't know whether this sentiment is just a byproduct of CS education, but for some reason people equate a programming language with the compute that goes on under the hood. Like if you write in Python, you are locked into the specific non optimized way of computing that Python does.

Its all machine code under the hood. Everything else on top is essentially description of more and more complex patterns of that code. So its a no brainer that a language that lets you describe those complex but repeating patterns in the most direct way is the most popular. When you use python, you are effectively using a framework on top of C to describe what you need, and then if you want to do something specialized for performance, you go back to the core fundamentals and write it in C.

visarga|2 years ago

Julia doesn't get the latest models first, or have as big of a community.

korijn|2 years ago

A vectorized implementation of find_close_polygons wouldn't be very complex or hard to maintain at all, but the authors would also have to ditch their OOP class based design, and that's the real issue here. The object model doesn't lend itself to performant, vectorized numpy code.

pbowyer|2 years ago

What's a good guide to learn how to make (and see) vectorized code? It's a mindshift and not one I find easy.

rwalle|2 years ago

Exactly, that is the real issue, vectorization might be good enough in terms of performance. It doesn't seem to be mentioned in the article at all.

FreeHugs|2 years ago

The most important part of the article seems to be that this Python code is taking "an avg of 293.41ms per iteration":

        def find_close_polygons(
            polygon_subset: List[Polygon], point: np.array, max_dist: float
        ) -> List[Polygon]:
            close_polygons = []
            for poly in polygon_subset:
                if np.linalg.norm(poly.center - point) < max_dist:
                    close_polygons.append(poly)

            return close_polygons

And after replacing it with this Rust code, it is taking "an avg of 23.44ms per iteration":

        use pyo3::prelude::*;
        use ndarray_linalg::Norm;
        use numpy::PyReadonlyArray1;

        #[pyfunction]
        fn find_close_polygons(
            py: Python<'_>,
            polygons: Vec<PyObject>,
            point: PyReadonlyArray1<f64>,
            max_dist: f64,
        ) -> PyResult<Vec<PyObject>> {
            let mut close_polygons = vec![];
            let point = point.as_array();
            for poly in polygons {
                let center = poly
                    .getattr(py, "center")?
                    .extract::<PyReadonlyArray1<f64>>(py)?
                    .as_array()
                    .to_owned();

                if (center - point).norm() < max_dist {
                    close_polygons.push(poly)
                }
            }

            Ok(close_polygons)
        }

Why is the Rust version 13x faster than the Python version?

spi|2 years ago

Yeah but the Python code is so bad that it's easy to get a 10x speedup using only numpy, as well. The current code essentially does:

    import numpy as np

    n_sides = 30
    n_polygons = 10000

    class Polygon:
        def __init__(self, x, y):
            self.x = x
            self.y = y
            self.center = np.array([self.x, self.y]).mean(axis=1)


    def find_close_polygons(
        polygon_subset: List[Polygon], point: np.array, max_dist: float
    ) -> List[Polygon]:
        close_polygons = []
        for poly in polygon_subset:
            if np.linalg.norm(poly.center - point) < max_dist:
                close_polygons.append(poly)

        return close_polygons

    polygons = [Polygon(*np.random.rand(2, n_sides)) for _ in range(n_polygons)]
    point = np.array([0, 0])
    max_dist = 0.5

    %timeit find_close_polygons(polygons, point, max_dist)

(I've made up number of sides and number of polygons to get to the same order of magnitude of runtime; also I've pre-computed centers, as they are cached anyway in their code), which on my machine takes about 40ms to run. If we just change the function to:

    def find_close_polygons(
        polygon_subset: List[Polygon], point: np.array, max_dist: float
    ) -> List[Polygon]:
        centers = np.array([polygon.center for polygon in polygon_subset])
        mask = np.linalg.norm(centers - point[None], axis=1) < max_dist
        return [
            polygon
            for polygon, is_pass in zip(polygon_subset, mask)
            if is_pass
        ]

then the same computation takes 4ms on my machine.

Doing a Python loop of numpy operations is a _bad_ idea... The new code hardly even takes more space than the original one.

(as someone else mentioned in the comments, you can also directly use the sum of the squares rather than `np.linalg.norm` to avoid taking square roots and save a few microseconds more, but well, we're not in that level of optimization here)

winrid|2 years ago

Python's for loop implementation is slow, also. You can use built in utils like map() which are "native" and can be a lot faster than a for loop with a push:

https://levelup.gitconnected.com/python-performance-showdown...

hannofcart|2 years ago

I was surprised that the Rust version is _only_ 13x as fast as the Python version.

lenkite|2 years ago

Shouldn't `close_polygons` be presized in both Python and Rust to avoid repeated allocation and copy ?

nickstinemates|2 years ago

One carries the entire feature set of the python runtime, the other is compiled.

hoseja|2 years ago

The final code takes just 2.90ms per iteration.

ssivark|2 years ago

Making Python (near infinitely) faster by using it as a glue language, and running all the computation outside Python :-P

INTPenis|2 years ago

Yeah what's wrong with that? I think this sounds amazing. It gives you all the fast prototyping and simplicity of Python, but once you hit that bottleneck all you have to do is bring in a ringer to replace key components with a faster language.

No need to use Golang or Rust from the start, no need for those resources until you absolutely need the speed improvement. Sounds like a dream to a lot of people who find it much easier to develop in Python.

baq|2 years ago

This is basically what Python was first designed for and as evidenced by the article still excels at

flohofwoe|2 years ago

Well, that's exactly where Python works well. As scripting glue sitting between or above native code which does the heavy-lifting.

est|2 years ago

> and running all the computation outside Python

You don't have to move all the computation, just `for` loops will help alot.

majoe|2 years ago

I had a similar problem, when I was working as a PhD student a few years ago, where I needed to match the voxel representation of a 3D printer with the tetrahedral mesh of our rendering application.

My first attempt in Python was both prohibitively slow and more complicated than necessary, because I tried to use vectorized numpy, where possible.

Since this was only a small standalone script, I rewrote it in Julia in a day. The end result was ca. 100x faster and the code a lot cleaner, because I could just implement the core logic for one tetrahedron and then use Julia's broadcast to apply it to the array of tetrahedrons.

Anyway, Julia's long startup time often prohibits it from being used inside other languages (even though the Python/Julia interoperability is good). On the contrary the Rust/Python interop presented here seems to be pretty great. Another reason I should finally invest the time to learn Rust.

hgomersall|2 years ago

Numba is great if you want to write a naive loop approach in python.

dunefox|2 years ago

Long startup time is relative. I believe it's much lower now than a couple of versions ago. 0.15s or so? Interop between python and rust will also take time.

xgdgsc|2 years ago

Julia 1.9 is fast. And you can use https://github.com/Suzhou-Tongyuan/jnumpy to write python extension in Julia now. So I think after 1.9 release julia would be much more usable.

brahbrah|2 years ago

This was a silly and unnecessary optimization. He’s just using numpy wrong.

Instead of:

for p in ps: norm(p.center - point)

You should do:

centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)

You’ll get your same speed up in 2 lines without introducing a new dependency

_glass|2 years ago

Isn't this the version of refenced on the github repo [0] which speeds up 6x instead of 101x?

  There's also a "v1.5" version which is 6x faster, and uses "vectorizing" (doing more of the work directly in numpy). This version is much harder to optimize further.

[0] https://github.com/ohadravid/poly-match

jerf|2 years ago

This is the major reason I don't really buy into things like JITs solving all performance problems (as long as you carefully write only and exactly the subset of the language they work well with) or NumPy not being affected by Python being slow. There's more code like this in the world than I think people realize.

Having to write in a subset of a language in order for it to perform decently is a big deal. Having no feedback given to the programmer when you deviate from the fast path makes it even harder to learn what the fast path is. The result is not that you get the ease of Python and the speed of C without having to understand much; the result is that you have to be a fairly deep expert in Python and understand the C bindings intimately and learn how to avoid doing what is the natural thing in Python, the Python covered in all the tutorials, or you end up writing your code to run at native Python speeds without even realizing it.

It's a feasible amount of knowledge to have, it's not like it's completely insane, but it's still rather a lot.

My career just brushed this world and I'm glad I bounced off of it. It would drive me insane to walk through this landmine field every day, and then worse, have to try to guide others through it all the while they are pointing at all the "common practices" that are also written by people utterly oblivious to all this.

oblio|2 years ago

This is nice, how would you go about as a performance noob? I can't imagine there's a line in the docs saying "this is slow!".

thanatropism|2 years ago

I feel a lot of the "Python perf" thing is an inferiority complex. cPython is getting faster all the time, and (obviously using libraries like numpy and others that hook into compiled code) I don't think I've ever seen it become a business bottleneck. If it's ever a scaling problem, then you hire lower-level language devs, it's a good problem to have.

Python is much, much easier to learn, and Rust is notoriously difficult. This obviously feeds into the inferiority complex. But -- while I do know there's a number of applications where perf is crucial -- I think it's well worth doing an ego check before moving from what's a lower friction path and gives you access to myriad developers, including ones trained in very hard disciplines.

Also: does it ever make sense to write something like a CRUD+ backend in Rust? Maybe 100X Rustaceans can do it with one hand tied behind their backs; but imagine what these ubermenschen could be achieving in Python?

karamanolev|2 years ago

We're working as an internal dev team for a moderately large company and the challenge is that very few people have to maintain a ton of business automation. We love our Python+Django stack as it gives it a ton of productivity, but as our business and data grows, some things are becoming quite slow.

For now, we're getting away with Python-only optimizations and relying on cPython improvements. Still, some processes take a few minutes and a lot of that is spent just because Python is quite slow. In that sense, yes, it happens for CRUD+ applications (which I consider our CRM+ERP to be).

faitswulff|2 years ago

I think you might be surprised. I would say Rust isn't notoriously difficult per se, it's just harder to please the compiler. But that's still viewing the compiler as an adversary when it's more like an assistant, so the analogy breaks down. You don't have to be a 100x engineer to use Rust. In fact, quite the opposite. Rust gives engineers a lot more guardrails to prevent what would be runtime errors in other languages.

insanitybit|2 years ago

I think for a lot of people, like myself, writing Rust is just as easy as writing Python (actually it's way easier IMO). So when people write Python it's sort of like "uh, why would you choose the slower tool?".

Obviously the answer is "because I'm faster with Python" but then part of my brain is "well get good idk".

pjmlp|2 years ago

If writing Tcl extensions in C during the .com wave 23 years ago taught me something, was that glue languages are great, and that I don't want to use any that doesn't come with either AOT or JIT compiler in the box, other than for OS scripting tasks.

JodieBenitez|2 years ago

> Also: does it ever make sense to write something like a CRUD+ backend in Rust?

Well... given 2 frameworks equally pleasant to work with, why not using the one with the best performance ? (Whatever performance means... less cpu ? less memory ? better i/o ?).

As a Django user, working on problems where Django shines, I have yet to see such a solution in Rust, but that doesn't mean it won't happen one day.

karussell|2 years ago

I wonder why GraalVM is not more often used for these speed critical cases: https://www.graalvm.org/python/ (Same for ruby https://www.graalvm.org/ruby/)

Is the problem the Oracle involvement? Or is it not that fast as advertised or problems with the ecosystem (C libraries)?

thisgoodlife|2 years ago

My concern is that it’s not ready for prod yet.

“At this point, the Python runtime is made available for experimentation and curious end-users. “

https://www.graalvm.org/latest/reference-manual/python/

dunefox|2 years ago

I would not use anything made by oracle if I have the choice. My team sees it similarly.

cornholio|2 years ago

Language design request:

Take from Rust Algebraic types, ahead of time compilation and strong types, functional features, a borrow / escape checker that automatically turns shared data into Rc or Arc, as necessary, instead of tormenting me to rewrite performance irrelevant code;

Take from Python the simple syntax, default pass by reference of all non-numeric types, simplified string handling, unified slice and array syntax - and any other simplifying feature possible.

... and give me a fast, safe and powerful language that gets out of my way while maximizing the power of the compiler to prevent bugs. Golang was a commendable attempt, but they made '70s design decisions that condemned the language: mandatory garbage collection, nullables, (void*) masquerading as interface{} casts, under-powered compiler etc.

machiaweliczny|2 years ago

I think Nim is closest to this wishlist

za3faran|2 years ago

Modern Java and modern C# should be closer to what you described.

On a side note, python is strictly pass by value. For non-primitives, their references are passed by value.

mijoharas|2 years ago

Sounds like swift to me

jcolella|2 years ago

This is possibly one of the best written articles end-to-end I have read. Excellent job telling the story

brahbrah|2 years ago

Agreed it was well written, but kinda pointless though since they could have “solved” the problem using the existing tools in a couple lines of code without any new deps. All that content annd profiling and they missed the fact that they were using numpy wrong.

Dowwie|2 years ago

Elixir is made faster with Rust, too. Rust is a great skillset to have for those measured moments.

saeranv|2 years ago

I wonder if being able to quickly retrieve a numpy array of the polygon centers would make an equivalent difference. Since then you could at least retrieve the centers from the polygon as an array you could just use numpy operations for the closest polygon operation:

``` centers = get_centers(polgons) # M x 3 array close_idx = np.where( np.linalg.norm(centers - point, axis=1) < max_dist)[0] close_polygons = polygons[close_idx,:] ```

That's one reason I prefer for to use arrays for polygons, rather then abstract it into a Python object. Fundamentally geometries are sequences of points, and with some zero-padding to account for irregular point counts, you can still keep them in a nice, efficient array representation.

toxik|2 years ago

Agreed, I speed up Python numpy code with numba quite often and it isn’t at all unreadable to put it in an ndarray subclass.

    poly = Polygon(vertices)

I would bet you can achieve just as much of a speedup with numba or Cython using this form.

swyx|2 years ago

in the domain of Python tooling made faster with Rust, check out https://github.com/charliermarsh/ruff which is 10-100x faster than pylint etc

moreresearchplz|2 years ago

Like to see comparison to when they use a spatial index like strtree or rtree.

crabbone|2 years ago

> Python is a superb API for researchers,

Where does this nonsense come from? No. It isn't. It's just a stupid fashion. Something that should be discouraged, not encouraged by trying to make it work when it's obviously broken.

As someone who does work with researchers who do use Python a lot, I see the everyday painful experiences of people who use it. And this pain doesn't need to be there. It's just masochism. And the only real reason is that they don't know any better. The only other thing they know is Matlab, and that's even worse.

Python is just a bad language. Popular, but awful. Ironically, while researchers are supposed to be on the forerfront of discovery and technology... well, they aren't. Industry outpaced research. So much so that today there are government programs to onboard researchers into more automated and more automatically verified way to do research. And we aren't talking about making an elite force here. These programs are meant for people in research who copy data from Excel sheets one data point at a time into another spreadsheet. It's that kind of bad.

My wife happened to work in such a government center, and that's how I know about what's going on inside these programs. And it's very sad that decisions about the preferred tools for research automation are made by people who, unlike most of their peers, had some exposure to what happens in the industry, but had no deeper understanding of the reasons any particular technology ended up in any particular niche, nor any independent ability to assess the capabilities of any particular tech. It's really sad what's going on there.

ActorNightly|2 years ago

>Python is just a bad language. Popular, but awful.

You couldn't be more wrong. Python is the language going forward. There is a reason why bleeding edge ML stuff is done through Python, as well as it being the backend to several very popular web platforms and is second most use language on github behind JS solely because JS is hard tied to web.

I have a feeling the hate for Python is just comes from paradigms that are taught in extremely poor CS curriculum in schools. If you think that Python is bad because of dynamic typing, you haven't been paying attention to the direction compute is going.

pahbloo|2 years ago

What are the best alternatives to Python then?

xgdgsc|2 years ago

Rust is verbose. I would use https://github.com/Suzhou-Tongyuan/jnumpy to write python extension in Julia and usually get similar performance.

shakow|2 years ago

Does it compile AoT or are you stuck with Julia start time?

wdroz|2 years ago

One of the others understated pros of rewriting some parts in Rust, it's that you can parallelize easily with Rayon[0]

[0] -- https://github.com/rayon-rs/rayon

wcrossbow|2 years ago

Rust is great but isn’t the core problem here using the wrong algorithm? It looks like this is ideally suited for a quad tree instead of a naive for loop. I would expect that to pulverise any current benchmark.

flohofwoe|2 years ago

I guess you also need to take the time into account to create the quad tree from an unsorted 'polygon soup' first, and in terms of coding effort, a brute force conversion from python to a compiled tight loop over unsorted arrays provides a lot of bang for the buck (and a speedup of 100x for relatively little effort might be 'good enough' for quite a while until the input data grows big enough to require the next optimization effort).

(e.g. it doesn't need to be "as fast as possible", just fast enough to no longer be a workflow bottleneck)

wanderingmind|2 years ago

Why do I need to rewrite in Rust, when I can just use Polars[1] that will cover most usecases [1] https://www.pola.rs/

masklinn|2 years ago

What is the relation? Polars replaces pandas, not numpy.

wenc|2 years ago

Polars and DuckDB have replaced Pandas for me for the most part.

yjftsjthsd-h|2 years ago

> The library was already using numpy for a lot of its calculations, so why should we expect Rust to be better?

I literally clicked in to read the article to see if they'd mention this:) But... unless I missed it, there wasn't really an answer? I thought numpy does do the heavy lifting in native code, so why is this faster? Does this version just push more of the logic into native code than numpy did?

alex_smart|2 years ago

Numpy is fast when the code is vectorized. The code they are benchmarking against was not vectorized. They wanted to calculated the distances of n points against a given point and find out which points are closer than a threshold (max_dist). Instead of vectorizing the whole operation, the python code was just calling numpy in a loop to find the distance of two points.

Just that small change already gives 10x faster performance without ever leaving python/numpy land.

fbdab103|2 years ago

They did have a choice quote a bit before the blurb you quoted. The real world problem is sufficiently more complex/different enough that vectorizing it would be a pain.

>It’s worth noting that converting parts of / everything to vectorized numpy might be possible for this toy library, but will be nearly impossible for the real library while making the code much less readable and modifiable...

anewhnaccount2|2 years ago

The article mentioned that there are some gains from using vectorised numpy methods, i.e. spending more time in numpy code. I would be interested in whether the List[Polygon] could be converted into two long arrays of all xs and all ys with indices into the starts (essentially a dense representation of a sparse array) and then the core function rewritten for Numba since it could now not use any Python objects. This would break the interface of course, but may get within striking distance of the Rust implementation.

biomcgary|2 years ago

The slowness comes from the interaction of numpy and a Python object "Polygon", which in not numpy. I suspect that a sufficiently clever coder could have optimized the result without resorting to Rust, but at the cost of a substantial increase in complexity of the codebase. The proposed approach keep the Python code simple (and moves the complexity into having another language to deal with).

lmeyerov|2 years ago

People can write slow code in any language ;-) We've had to fire contractors who "wrote" dataframe code but was not vectorized in practice despite repeat request to do so. Same thing for slow code in CUDA.

From a maintenance view, I much prefer folks write vectorized data frames vs numpy or low level bindings, but that comes from having lived with the alternative for a lot longer. All of our exceptions are pretty much slotted for deletion. (Our fast path is single or multi-GPU dataframes in python.) Here's to hoping that one day we'll have dependent types in mypy!

osmanbaskaya|2 years ago

I am really curious if there is an important reason why not trying this performance improvement with Cython first. Can someone comfortable with Cython explain what are the pros and cons doing this optimization with Cython?

1MachineElf|2 years ago

Python and Rust. Apple has some job openings in distributed systems for people fluent in both.

gjourdvhiokhf|2 years ago

Is Apple still hybrid only, no remote? Any links to some of their careers pages on this?

Piqued my curiosity as my background almost lines up, and I'm interested in that kind of role... But I live near one of their offices only hiring for skill sets I don't really have.

poulpy123|2 years ago

I'm wondering how much would be the speedup by rewriting the critical part in cython

jacquesm|2 years ago

It's not really Python that was sped up though, it was an application written in python augmented with a bit of optimized rust code.

This sort of hybrid is super common, you typically spend 90%+ of your time in computationally intensive problems in a very small subset of your code, typically the innermost loops. Optimizing those will have very good pay-off.

Traditionally we'd do this with high level stuff in one language and then assembly for the performance critical parts, these days it is more likely a combination of a scripting language for the high level part and a compiled language for the low level parts (C, rust, whatever). Java and such as less suitable for such optimization purposes, both because they come with a huge runtime and because they are hard to interface to other languages unless they happen to use the same underlying VM, but then there usually isn't much performance gain.

Another nice way to optimize computationally intensive code is by finding out if the code is suitable for adaptation to the GPU, which can give you many orders of magnitude speed improvement in some cases.

brandonpelfrey|2 years ago

Very cool! I can see myself using this soon actually :) On top of the "code speed up" this is a good problem for 2d data structures for performing this type of "find objects within radius" type of query.

orangepurple|2 years ago

It is an alternative framing of the N+1 problem (mistake) SQL users make

https://news.ycombinator.com/item?id=34207974

sandGorgon|2 years ago

>Also, using any JIT-based tricks (PyPy / numba) results in very small gains (as we will measure, just to make sure).

i wasnt able to see the numba comparison. anyone know how much worse it was ?

mattbillenstein|2 years ago

Kudos - this is a very nice blog post - "real" engineering ;)

kavalg|2 years ago

For this particular code, vectorization and some acceleration library, such as JAX, may be a better path to optimization. Otherwise an excellent article!

Heston|2 years ago

It's much easier and more accurate to time your python scripts with `time python script.py`. Cool write up.

masklinn|2 years ago

`time` is absolutely awful, with the minor exception of bsd’s time maybe.

If you’re going to benchmark scripts or executables, use hyperfine.

firechickenbird|2 years ago

The premise was to not rewrite everything in rust, but you basically ended up rewriting 90% of it in rust

smnrchrds|2 years ago

90% of the bottleneck, not 90% of their whole application. The author says that rewriting everything in Rust would have taken months, so the whole application must be huge.

"It is big and complex and very business critical and highly algorithmic, so that would take ~months of work, ..."

Animats|2 years ago

Using PyPy, which is a real compiler, might help.

That's doing spatial data processing by exaustive search, which is inherently slow. There are better algorithms. If the number of items to be searched is large, the spatial indices of MySQL could help.

why_only_15|2 years ago

They tried PyPy at the beginning and it was 2x slower. Plausibly it would be better with additional optimization, but it's not cut and dry.

laerus|2 years ago

PyPy is a JIT compiler not a "real compiler", it requires warm up time to start optimizing code on runtime.

rwalle|2 years ago

Did you read the article?

akasakahakada|2 years ago

The original code already look crap due to making a new list containing object instead of a mask or something else. Also that can be done using list comprehension. Also it totally can be vectorize or parallelize.

You need more experienced Python engineer.

tus666|2 years ago

> Rust (with the help of pyo3) unlocks true native performance for everyday Python code, with minimal compromises.

Hasn't he heard of ctypes? You can wrap C structs add Python objects since forever.

UncleEntity|2 years ago

There’s probably a way to get at the numpy objects without having to go through the python runtime and do all the computation in pure C.

I assume, haven’t really messed with numpy for anything but I can’t imagine it wouldn’t work that way.

unknown|2 years ago

[deleted]

isaacfrond|2 years ago

Title is terrible. Better: A Python code is 100x faster by rewriting small part in Rust

dang|2 years ago

We've taken the magic numbers out of the title now, as the site guidelines ask - https://news.ycombinator.com/newsguidelines.html

215 comments