Good for you! You did everything right: measure always, fix the bottleneck if possible, rewrite if necessary.
A little tip, you don't have to compare actual distances, you can compare squared distances just as well. Then in `norm < max_dist`, you don't have to do a `sqrt()` for every `norm`. Saves a few CPU ticks as well.
I once rewrote a GDI+ point transformation routine in pure C# and got 200x speedup just because the routine was riddled with needless virtual constructors, copying type conversions, and something called CreateInstanceSlow. Ten years after, I gathered a few of these anecdotes and wrote the Geometry for Programmers book (https://www.manning.com/books/geometry-for-programmers) with its main message: when you know geometry behind your tools, you can either use them efficiently, or rewrite them completely.
The ignore the square root while computing/comparing distances trick is a great one. That's how I got to the top of the performance leaderboard in my first algorithms class.
I think a big mistake in the article, in a context where performance is the main objective, is that the author uses an array of structs (AoS), rather than a struct of arrays (SoA). An SoA makes it so that the data is ordered contiguously, which is easy to read for the CPU, while an AoS structure interleaves different data (namely the x and y in this case), which is very annoying for the CPU. A CPU likes to read chunks of data (for example 128 bits of data/read) and to process these with SIMD instructions, executing a multiple of calculations with one CPU cycle. This is completely broken when using an array of structs.
He uses the same data structure in both the Python and Rust code, so I imagine that he can get an extra 4x speedup at least if he rewrites his code with memory layout in mind.
Apache Arrow (https://arrow.apache.org/overview/) is built exactly around this idea: it's a library for managing the in-memory representation of large datasets.
Author here:
I agree, that's a great perf advice (esp. when you can restructure your code).
I couldn't get into a this in the article (would be too long), but this is a great point and the original library does this in a lot of places.
One problem in our use case is that the actual structs members are pretty big & that we need to group/regroup them a lot.
The fastest approach for us was to do something like in the article for the initial filtering, then build a hashmap of SoAs with the needed data, and do the heavier math on that.
Modern CPU caches are usually loaded in 64-byte units - much larger than 128 bits. I just ran some tests with a C program on an Intel I5 with both AoS and SoA using a list of 1B points with 32-bit X and Y components. Looping through the list of points and totaling all X and Y components was the same speed with either AoS or SoA.
It's easy to make intuitive guesses about how things are working that seem completely reasonable. But you have to benchmark because modern CPUs are so complex that reasoning and intuition mostly don't work.
Programs used for testing are below. I ran everything twice because my system wasn't always idle, so take the lower of the 2 runs.
[jim@mbp ~]$ sh -x x
+ cat x1.c
#include <stdio.h>
#define NUM 1000000000
struct {
int x;
int y;
} p[NUM];
int main() {
int i,s;
for (i=0; i<NUM; i++) {
p[i].x = i;
p[i].y = i;
}
s=0;
for (i=0; i<NUM; i++) {
s += p[i].x + p[i].y;
}
printf("s=%d\n", s);
}
+ cc -o x1 x1.c
+ ./x1
s=1808348672
real 0m12.078s
user 0m7.319s
sys 0m4.363s
+ ./x1
s=1808348672
real 0m9.415s
user 0m6.677s
sys 0m2.685s
+ cat x2.c
#include <stdio.h>
#define NUM 1000000000
int x[NUM];
int y[NUM];
int main() {
int i,s;
for (i=0; i<NUM; i++) {
x[i] = i;
y[i] = i;
}
s=0;
for (i=0; i<NUM; i++) {
s += x[i] + y[i];
}
printf("s=%d\n", s);
}
+ cc -o x2 x2.c
+ ./x2
s=1808348672
real 0m9.753s
user 0m6.713s
sys 0m2.967s
+ ./x2
s=1808348672
real 0m9.642s
user 0m6.674s
sys 0m2.902s
+ cat x3.c
#include <stdio.h>
#define NUM 1000000000
struct {
int x;
int y;
} p[NUM];
int main() {
int i,s;
for (i=0; i<NUM; i++) {
p[i].x = i;
}
for (i=0; i<NUM; i++) {
p[i].y = i;
}
s=0;
for (i=0; i<NUM; i++) {
s += p[i].x;
}
for (i=0; i<NUM; i++) {
s += p[i].y;
}
printf("s=%d\n", s);
}
+ cc -o x3 x3.c
+ ./x3
s=1808348672
real 0m13.844s
user 0m11.095s
sys 0m2.700s
+ ./x3
s=1808348672
real 0m13.686s
user 0m11.038s
sys 0m2.611s
+ cat x4.c
#include <stdio.h>
#define NUM 1000000000
int x[NUM];
int y[NUM];
int main() {
int i,s;
for (i=0; i<NUM; i++)
x[i] = i;
for (i=0; i<NUM; i++)
y[i] = i;
s=0;
for (i=0; i<NUM; i++)
s += x[i];
for (i=0; i<NUM; i++)
s += y[i];
printf("s=%d\n", s);
}
+ cc -o x4 x4.c
+ ./x4
s=1808348672
real 0m13.530s
user 0m10.851s
sys 0m2.633s
+ ./x4
s=1808348672
real 0m13.489s
user 0m10.856s
sys 0m2.603s
This is a great article but there's still a core problem there - why should developers have to choose between accessibility and performance?
So much scientific computing code suffers between core packages being split away from their core language - at what point do we stop and abandon python for languages which actually make sense? Obviously julia is the big example here, but its interest, development and ecosystem doesn't seem to be growing at a serious pace. Given that the syntax is moderately similar and the performance benefits are often 10x what's stopping people from switching???
Today, there is a Python package for everything. The ecosystem is possibly best in class for having a library available that will do X. You cannot separate the language from the ecosystem. Being better, faster, and stronger means little if I have to write all of my own supporting libraries.
Also, few scientific programmers have any notion of what C or Fortran is under the hood. Most are happy to stand on the shoulders of giants and do work with their specialized datasets. Which for the vast majority of researchers are not big data. If the one-time calculation takes 12 seconds instead of 0.1 seconds is not a problem worth optimizing.
Because professional software developers with a background in CS are a minority of people who program today. The learning curve of pointers, memory-allocation, binary operations, programming paradigms, O-Notation and other things you need to understand to efficiently code in something like C is a lot to ask of someone who is for example primarily a sociologist or biologist.
The use case btw. is often also very different. In most of academia, writing code is basically just a fancy mode of documentation for what is basically a glorified calculator. Readability trumps efficiency by a large margin every time.
everything. why are there still cobol programmers? why is c++ still the defacto native language (also in research)?
but also I don't see any problem there, I think the python + c++/rust idiom is actually pretty nice. I have a billion libs to choose from on either side. Great usability on the py side, and unbeatable performance on the c++ side
One of Julia's Achilles heels is standalone, ahead-of-time compilation. Technically this is already possible [1], [2], but there are quite a few limitations when doing this (e.g. "Hello world" is 150 MB [6]) and it's not an easy or natural process.
The immature AoT capabilities are a huge pain to deal with when writing large code packages or even when trying to make command line applications. Things have to be recompiled each time the Julia runtime is shut down. The current strategy in the community to get around this seems to be "keep the REPL alive as long as possible" [3][4][5], but this isn't a viable option for all use cases.
Until Julia has better AoT compilation support, it's going to be very difficult to develop large scale programs with it. Version 1.9 has better support for caching compiled code, but I really wish there were better options for AoT compiling small, static, standalone executables and libraries.
IME, for having used Julia quite extensively in Academia:
- the development experience is hampered by the slow start time;
- the ecosystem is quite brittle;
- the promised performances are quite hard to actually reach, profiling only gets you so far;
- the ecosystem is pretty young, and it shows (lack of docs, small community, ...)
> what's stopping people from switching???
All of the mentioned above, inertia, perfect is the enemy of good enough, the alternatives are far away from python ecosystem & community, performances are not often a show blocker.
I don't know whether this sentiment is just a byproduct of CS education, but for some reason people equate a programming language with the compute that goes on under the hood. Like if you write in Python, you are locked into the specific non optimized way of computing that Python does.
Its all machine code under the hood. Everything else on top is essentially description of more and more complex patterns of that code. So its a no brainer that a language that lets you describe those complex but repeating patterns in the most direct way is the most popular. When you use python, you are effectively using a framework on top of C to describe what you need, and then if you want to do something specialized for performance, you go back to the core fundamentals and write it in C.
A vectorized implementation of find_close_polygons wouldn't be very complex or hard to maintain at all, but the authors would also have to ditch their OOP class based design, and that's the real issue here. The object model doesn't lend itself to performant, vectorized numpy code.
Yeah but the Python code is so bad that it's easy to get a 10x speedup using only numpy, as well. The current code essentially does:
import numpy as np
n_sides = 30
n_polygons = 10000
class Polygon:
def __init__(self, x, y):
self.x = x
self.y = y
self.center = np.array([self.x, self.y]).mean(axis=1)
def find_close_polygons(
polygon_subset: List[Polygon], point: np.array, max_dist: float
) -> List[Polygon]:
close_polygons = []
for poly in polygon_subset:
if np.linalg.norm(poly.center - point) < max_dist:
close_polygons.append(poly)
return close_polygons
polygons = [Polygon(*np.random.rand(2, n_sides)) for _ in range(n_polygons)]
point = np.array([0, 0])
max_dist = 0.5
%timeit find_close_polygons(polygons, point, max_dist)
(I've made up number of sides and number of polygons to get to the same order of magnitude of runtime; also I've pre-computed centers, as they are cached anyway in their code), which on my machine takes about 40ms to run. If we just change the function to:
def find_close_polygons(
polygon_subset: List[Polygon], point: np.array, max_dist: float
) -> List[Polygon]:
centers = np.array([polygon.center for polygon in polygon_subset])
mask = np.linalg.norm(centers - point[None], axis=1) < max_dist
return [
polygon
for polygon, is_pass in zip(polygon_subset, mask)
if is_pass
]
then the same computation takes 4ms on my machine.
Doing a Python loop of numpy operations is a _bad_ idea... The new code hardly even takes more space than the original one.
(as someone else mentioned in the comments, you can also directly use the sum of the squares rather than `np.linalg.norm` to avoid taking square roots and save a few microseconds more, but well, we're not in that level of optimization here)
Python's for loop implementation is slow, also. You can use built in utils like map() which are "native" and can be a lot faster than a for loop with a push:
Yeah what's wrong with that? I think this sounds amazing. It gives you all the fast prototyping and simplicity of Python, but once you hit that bottleneck all you have to do is bring in a ringer to replace key components with a faster language.
No need to use Golang or Rust from the start, no need for those resources until you absolutely need the speed improvement. Sounds like a dream to a lot of people who find it much easier to develop in Python.
I had a similar problem, when I was working as a PhD student a few years ago, where I needed to match the voxel representation of a 3D printer with the tetrahedral mesh of our rendering application.
My first attempt in Python was both prohibitively slow and more complicated than necessary, because I tried to use vectorized numpy, where possible.
Since this was only a small standalone script, I rewrote it in Julia in a day. The end result was ca. 100x faster and the code a lot cleaner, because I could just implement the core logic for one tetrahedron and then use Julia's broadcast to apply it to the array of tetrahedrons.
Anyway, Julia's long startup time often prohibits it from being used inside other languages (even though the Python/Julia interoperability is good). On the contrary the Rust/Python interop presented here seems to be pretty great. Another reason I should finally invest the time to learn Rust.
Long startup time is relative. I believe it's much lower now than a couple of versions ago. 0.15s or so? Interop between python and rust will also take time.
Julia 1.9 is fast. And you can use https://github.com/Suzhou-Tongyuan/jnumpy to write python extension in Julia now. So I think after 1.9 release julia would be much more usable.
Isn't this the version of refenced on the github repo [0] which speeds up 6x instead of 101x?
There's also a "v1.5" version which is 6x faster, and uses "vectorizing" (doing more of the work directly in numpy). This version is much harder to optimize further.
This is the major reason I don't really buy into things like JITs solving all performance problems (as long as you carefully write only and exactly the subset of the language they work well with) or NumPy not being affected by Python being slow. There's more code like this in the world than I think people realize.
Having to write in a subset of a language in order for it to perform decently is a big deal. Having no feedback given to the programmer when you deviate from the fast path makes it even harder to learn what the fast path is. The result is not that you get the ease of Python and the speed of C without having to understand much; the result is that you have to be a fairly deep expert in Python and understand the C bindings intimately and learn how to avoid doing what is the natural thing in Python, the Python covered in all the tutorials, or you end up writing your code to run at native Python speeds without even realizing it.
It's a feasible amount of knowledge to have, it's not like it's completely insane, but it's still rather a lot.
My career just brushed this world and I'm glad I bounced off of it. It would drive me insane to walk through this landmine field every day, and then worse, have to try to guide others through it all the while they are pointing at all the "common practices" that are also written by people utterly oblivious to all this.
I feel a lot of the "Python perf" thing is an inferiority complex. cPython is getting faster all the time, and (obviously using libraries like numpy and others that hook into compiled code) I don't think I've ever seen it become a business bottleneck. If it's ever a scaling problem, then you hire lower-level language devs, it's a good problem to have.
Python is much, much easier to learn, and Rust is notoriously difficult. This obviously feeds into the inferiority complex. But -- while I do know there's a number of applications where perf is crucial -- I think it's well worth doing an ego check before moving from what's a lower friction path and gives you access to myriad developers, including ones trained in very hard disciplines.
Also: does it ever make sense to write something like a CRUD+ backend in Rust? Maybe 100X Rustaceans can do it with one hand tied behind their backs; but imagine what these ubermenschen could be achieving in Python?
We're working as an internal dev team for a moderately large company and the challenge is that very few people have to maintain a ton of business automation. We love our Python+Django stack as it gives it a ton of productivity, but as our business and data grows, some things are becoming quite slow.
For now, we're getting away with Python-only optimizations and relying on cPython improvements. Still, some processes take a few minutes and a lot of that is spent just because Python is quite slow. In that sense, yes, it happens for CRUD+ applications (which I consider our CRM+ERP to be).
I think you might be surprised. I would say Rust isn't notoriously difficult per se, it's just harder to please the compiler. But that's still viewing the compiler as an adversary when it's more like an assistant, so the analogy breaks down. You don't have to be a 100x engineer to use Rust. In fact, quite the opposite. Rust gives engineers a lot more guardrails to prevent what would be runtime errors in other languages.
I think for a lot of people, like myself, writing Rust is just as easy as writing Python (actually it's way easier IMO). So when people write Python it's sort of like "uh, why would you choose the slower tool?".
Obviously the answer is "because I'm faster with Python" but then part of my brain is "well get good idk".
If writing Tcl extensions in C during the .com wave 23 years ago taught me something, was that glue languages are great, and that I don't want to use any that doesn't come with either AOT or JIT compiler in the box, other than for OS scripting tasks.
> Also: does it ever make sense to write something like a CRUD+ backend in Rust?
Well... given 2 frameworks equally pleasant to work with, why not using the one with the best performance ? (Whatever performance means... less cpu ? less memory ? better i/o ?).
As a Django user, working on problems where Django shines, I have yet to see such a solution in Rust, but that doesn't mean it won't happen one day.
Take from Rust Algebraic types, ahead of time compilation and strong types, functional features, a borrow / escape checker that automatically turns shared data into Rc or Arc, as necessary, instead of tormenting me to rewrite performance irrelevant code;
Take from Python the simple syntax, default pass by reference of all non-numeric types, simplified string handling, unified slice and array syntax - and any other simplifying feature possible.
... and give me a fast, safe and powerful language that gets out of my way while maximizing the power of the compiler to prevent bugs. Golang was a commendable attempt, but they made '70s design decisions that condemned the language: mandatory garbage collection, nullables, (void*) masquerading as interface{} casts, under-powered compiler etc.
Agreed it was well written, but kinda pointless though since they could have “solved” the problem using the existing tools in a couple lines of code without any new deps. All that content annd profiling and they missed the fact that they were using numpy wrong.
I wonder if being able to quickly retrieve a numpy array of the polygon centers would make an equivalent difference. Since then you could at least retrieve the centers from the polygon as an array you could just use numpy operations for the closest polygon operation:
```
centers = get_centers(polgons) # M x 3 array
close_idx = np.where(
np.linalg.norm(centers - point, axis=1) < max_dist)[0]
close_polygons = polygons[close_idx,:]
```
That's one reason I prefer for to use arrays for polygons, rather then abstract it into a Python object. Fundamentally geometries are sequences of points, and with some zero-padding to account for irregular point counts, you can still keep them in a nice, efficient array representation.
Where does this nonsense come from? No. It isn't. It's just a stupid fashion. Something that should be discouraged, not encouraged by trying to make it work when it's obviously broken.
As someone who does work with researchers who do use Python a lot, I see the everyday painful experiences of people who use it. And this pain doesn't need to be there. It's just masochism. And the only real reason is that they don't know any better. The only other thing they know is Matlab, and that's even worse.
Python is just a bad language. Popular, but awful. Ironically, while researchers are supposed to be on the forerfront of discovery and technology... well, they aren't. Industry outpaced research. So much so that today there are government programs to onboard researchers into more automated and more automatically verified way to do research. And we aren't talking about making an elite force here. These programs are meant for people in research who copy data from Excel sheets one data point at a time into another spreadsheet. It's that kind of bad.
My wife happened to work in such a government center, and that's how I know about what's going on inside these programs. And it's very sad that decisions about the preferred tools for research automation are made by people who, unlike most of their peers, had some exposure to what happens in the industry, but had no deeper understanding of the reasons any particular technology ended up in any particular niche, nor any independent ability to assess the capabilities of any particular tech. It's really sad what's going on there.
>Python is just a bad language. Popular, but awful.
You couldn't be more wrong. Python is the language going forward. There is a reason why bleeding edge ML stuff is done through Python, as well as it being the backend to several very popular web platforms and is second most use language on github behind JS solely because JS is hard tied to web.
I have a feeling the hate for Python is just comes from paradigms that are taught in extremely poor CS curriculum in schools. If you think that Python is bad because of dynamic typing, you haven't been paying attention to the direction compute is going.
Rust is great but isn’t the core problem here using the wrong algorithm? It looks like this is ideally suited for a quad tree instead of a naive for loop. I would expect that to pulverise any current benchmark.
I guess you also need to take the time into account to create the quad tree from an unsorted 'polygon soup' first, and in terms of coding effort, a brute force conversion from python to a compiled tight loop over unsorted arrays provides a lot of bang for the buck (and a speedup of 100x for relatively little effort might be 'good enough' for quite a while until the input data grows big enough to require the next optimization effort).
(e.g. it doesn't need to be "as fast as possible", just fast enough to no longer be a workflow bottleneck)
> The library was already using numpy for a lot of its calculations, so why should we expect Rust to be better?
I literally clicked in to read the article to see if they'd mention this:) But... unless I missed it, there wasn't really an answer? I thought numpy does do the heavy lifting in native code, so why is this faster? Does this version just push more of the logic into native code than numpy did?
Numpy is fast when the code is vectorized. The code they are benchmarking against was not vectorized. They wanted to calculated the distances of n points against a given point and find out which points are closer than a threshold (max_dist). Instead of vectorizing the whole operation, the python code was just calling numpy in a loop to find the distance of two points.
Just that small change already gives 10x faster performance without ever leaving python/numpy land.
They did have a choice quote a bit before the blurb you quoted. The real world problem is sufficiently more complex/different enough that vectorizing it would be a pain.
>It’s worth noting that converting parts of / everything to vectorized numpy might be possible for this toy library, but will be nearly impossible for the real library while making the code much less readable and modifiable...
The article mentioned that there are some gains from using vectorised numpy methods, i.e. spending more time in numpy code. I would be interested in whether the List[Polygon] could be converted into two long arrays of all xs and all ys with indices into the starts (essentially a dense representation of a sparse array) and then the core function rewritten for Numba since it could now not use any Python objects. This would break the interface of course, but may get within striking distance of the Rust implementation.
The slowness comes from the interaction of numpy and a Python object "Polygon", which in not numpy. I suspect that a sufficiently clever coder could have optimized the result without resorting to Rust, but at the cost of a substantial increase in complexity of the codebase. The proposed approach keep the Python code simple (and moves the complexity into having another language to deal with).
People can write slow code in any language ;-) We've had to fire contractors who "wrote" dataframe code but was not vectorized in practice despite repeat request to do so. Same thing for slow code in CUDA.
From a maintenance view, I much prefer folks write vectorized data frames vs numpy or low level bindings, but that comes from having lived with the alternative for a lot longer. All of our exceptions are pretty much slotted for deletion. (Our fast path is single or multi-GPU dataframes in python.) Here's to hoping that one day we'll have dependent types in mypy!
I am really curious if there is an important reason why not trying this performance improvement with Cython first. Can someone comfortable with Cython explain what are the pros and cons doing this optimization with Cython?
Is Apple still hybrid only, no remote? Any links to some of their careers pages on this?
Piqued my curiosity as my background almost lines up, and I'm interested in that kind of role... But I live near one of their offices only hiring for skill sets I don't really have.
It's not really Python that was sped up though, it was an application written in python augmented with a bit of optimized rust code.
This sort of hybrid is super common, you typically spend 90%+ of your time in computationally intensive problems in a very small subset of your code, typically the innermost loops. Optimizing those will have very good pay-off.
Traditionally we'd do this with high level stuff in one language and then assembly for the performance critical parts, these days it is more likely a combination of a scripting language for the high level part and a compiled language for the low level parts (C, rust, whatever). Java and such as less suitable for such optimization purposes, both because they come with a huge runtime and because they are hard to interface to other languages unless they happen to use the same underlying VM, but then there usually isn't much performance gain.
Another nice way to optimize computationally intensive code is by finding out if the code is suitable for adaptation to the GPU, which can give you many orders of magnitude speed improvement in some cases.
Very cool! I can see myself using this soon actually :) On top of the "code speed up" this is a good problem for 2d data structures for performing this type of "find objects within radius" type of query.
For this particular code, vectorization and some acceleration library, such as JAX, may be a better path to optimization. Otherwise an excellent article!
90% of the bottleneck, not 90% of their whole application. The author says that rewriting everything in Rust would have taken months, so the whole application must be huge.
"It is big and complex and very business critical and highly algorithmic, so that would take ~months of work, ..."
That's doing spatial data processing by exaustive search, which is inherently slow.
There are better algorithms. If the number of items to be searched is large, the spatial indices of MySQL could help.
The original code already look crap due to making a new list containing object instead of a mask or something else. Also that can be done using list comprehension. Also it totally can be vectorize or parallelize.
okaleniuk|2 years ago
A little tip, you don't have to compare actual distances, you can compare squared distances just as well. Then in `norm < max_dist`, you don't have to do a `sqrt()` for every `norm`. Saves a few CPU ticks as well.
I once rewrote a GDI+ point transformation routine in pure C# and got 200x speedup just because the routine was riddled with needless virtual constructors, copying type conversions, and something called CreateInstanceSlow. Ten years after, I gathered a few of these anecdotes and wrote the Geometry for Programmers book (https://www.manning.com/books/geometry-for-programmers) with its main message: when you know geometry behind your tools, you can either use them efficiently, or rewrite them completely.
masklinn|2 years ago
ZeroCool2u|2 years ago
appeldorian|2 years ago
He uses the same data structure in both the Python and Rust code, so I imagine that he can get an extra 4x speedup at least if he rewrites his code with memory layout in mind.
tremon|2 years ago
ohr|2 years ago
I couldn't get into a this in the article (would be too long), but this is a great point and the original library does this in a lot of places.
One problem in our use case is that the actual structs members are pretty big & that we need to group/regroup them a lot.
The fastest approach for us was to do something like in the article for the initial filtering, then build a hashmap of SoAs with the needed data, and do the heavier math on that.
Yoric|2 years ago
prirun|2 years ago
It's easy to make intuitive guesses about how things are working that seem completely reasonable. But you have to benchmark because modern CPUs are so complex that reasoning and intuition mostly don't work.
Programs used for testing are below. I ran everything twice because my system wasn't always idle, so take the lower of the 2 runs.
b0b10101|2 years ago
So much scientific computing code suffers between core packages being split away from their core language - at what point do we stop and abandon python for languages which actually make sense? Obviously julia is the big example here, but its interest, development and ecosystem doesn't seem to be growing at a serious pace. Given that the syntax is moderately similar and the performance benefits are often 10x what's stopping people from switching???
fbdab103|2 years ago
Also, few scientific programmers have any notion of what C or Fortran is under the hood. Most are happy to stand on the shoulders of giants and do work with their specialized datasets. Which for the vast majority of researchers are not big data. If the one-time calculation takes 12 seconds instead of 0.1 seconds is not a problem worth optimizing.
bakuninsbart|2 years ago
The use case btw. is often also very different. In most of academia, writing code is basically just a fancy mode of documentation for what is basically a glorified calculator. Readability trumps efficiency by a large margin every time.
brahbrah|2 years ago
centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)
They were just using numpy wrong. You can be slow in any language if you use the tools wrong
ThouYS|2 years ago
but also I don't see any problem there, I think the python + c++/rust idiom is actually pretty nice. I have a billion libs to choose from on either side. Great usability on the py side, and unbeatable performance on the c++ side
ubj|2 years ago
The immature AoT capabilities are a huge pain to deal with when writing large code packages or even when trying to make command line applications. Things have to be recompiled each time the Julia runtime is shut down. The current strategy in the community to get around this seems to be "keep the REPL alive as long as possible" [3][4][5], but this isn't a viable option for all use cases.
Until Julia has better AoT compilation support, it's going to be very difficult to develop large scale programs with it. Version 1.9 has better support for caching compiled code, but I really wish there were better options for AoT compiling small, static, standalone executables and libraries.
[1]: https://julialang.github.io/PackageCompiler.jl/dev/
[2]: https://github.com/tshort/StaticCompiler.jl
[3]: https://discourse.julialang.org/t/ann-the-ion-command-line-f...
[4]: https://discourse.julialang.org/t/extremely-slow-execution-t...
[5]: https://discourse.julialang.org/t/extremely-slow-execution-t...
[6]: https://www.reddit.com/r/Julia/comments/ytegfk/size_of_a_hel...
shakow|2 years ago
- the development experience is hampered by the slow start time;
- the ecosystem is quite brittle;
- the promised performances are quite hard to actually reach, profiling only gets you so far;
- the ecosystem is pretty young, and it shows (lack of docs, small community, ...)
> what's stopping people from switching???
All of the mentioned above, inertia, perfect is the enemy of good enough, the alternatives are far away from python ecosystem & community, performances are not often a show blocker.
ActorNightly|2 years ago
Its all machine code under the hood. Everything else on top is essentially description of more and more complex patterns of that code. So its a no brainer that a language that lets you describe those complex but repeating patterns in the most direct way is the most popular. When you use python, you are effectively using a framework on top of C to describe what you need, and then if you want to do something specialized for performance, you go back to the core fundamentals and write it in C.
visarga|2 years ago
korijn|2 years ago
pbowyer|2 years ago
rwalle|2 years ago
FreeHugs|2 years ago
spi|2 years ago
Doing a Python loop of numpy operations is a _bad_ idea... The new code hardly even takes more space than the original one.
(as someone else mentioned in the comments, you can also directly use the sum of the squares rather than `np.linalg.norm` to avoid taking square roots and save a few microseconds more, but well, we're not in that level of optimization here)
winrid|2 years ago
https://levelup.gitconnected.com/python-performance-showdown...
hannofcart|2 years ago
lenkite|2 years ago
nickstinemates|2 years ago
hoseja|2 years ago
ssivark|2 years ago
INTPenis|2 years ago
No need to use Golang or Rust from the start, no need for those resources until you absolutely need the speed improvement. Sounds like a dream to a lot of people who find it much easier to develop in Python.
baq|2 years ago
flohofwoe|2 years ago
est|2 years ago
You don't have to move all the computation, just `for` loops will help alot.
majoe|2 years ago
My first attempt in Python was both prohibitively slow and more complicated than necessary, because I tried to use vectorized numpy, where possible.
Since this was only a small standalone script, I rewrote it in Julia in a day. The end result was ca. 100x faster and the code a lot cleaner, because I could just implement the core logic for one tetrahedron and then use Julia's broadcast to apply it to the array of tetrahedrons.
Anyway, Julia's long startup time often prohibits it from being used inside other languages (even though the Python/Julia interoperability is good). On the contrary the Rust/Python interop presented here seems to be pretty great. Another reason I should finally invest the time to learn Rust.
hgomersall|2 years ago
dunefox|2 years ago
xgdgsc|2 years ago
brahbrah|2 years ago
Instead of:
for p in ps: norm(p.center - point)
You should do:
centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)
You’ll get your same speed up in 2 lines without introducing a new dependency
_glass|2 years ago
jerf|2 years ago
Having to write in a subset of a language in order for it to perform decently is a big deal. Having no feedback given to the programmer when you deviate from the fast path makes it even harder to learn what the fast path is. The result is not that you get the ease of Python and the speed of C without having to understand much; the result is that you have to be a fairly deep expert in Python and understand the C bindings intimately and learn how to avoid doing what is the natural thing in Python, the Python covered in all the tutorials, or you end up writing your code to run at native Python speeds without even realizing it.
It's a feasible amount of knowledge to have, it's not like it's completely insane, but it's still rather a lot.
My career just brushed this world and I'm glad I bounced off of it. It would drive me insane to walk through this landmine field every day, and then worse, have to try to guide others through it all the while they are pointing at all the "common practices" that are also written by people utterly oblivious to all this.
oblio|2 years ago
thanatropism|2 years ago
Python is much, much easier to learn, and Rust is notoriously difficult. This obviously feeds into the inferiority complex. But -- while I do know there's a number of applications where perf is crucial -- I think it's well worth doing an ego check before moving from what's a lower friction path and gives you access to myriad developers, including ones trained in very hard disciplines.
Also: does it ever make sense to write something like a CRUD+ backend in Rust? Maybe 100X Rustaceans can do it with one hand tied behind their backs; but imagine what these ubermenschen could be achieving in Python?
karamanolev|2 years ago
For now, we're getting away with Python-only optimizations and relying on cPython improvements. Still, some processes take a few minutes and a lot of that is spent just because Python is quite slow. In that sense, yes, it happens for CRUD+ applications (which I consider our CRM+ERP to be).
faitswulff|2 years ago
insanitybit|2 years ago
Obviously the answer is "because I'm faster with Python" but then part of my brain is "well get good idk".
pjmlp|2 years ago
JodieBenitez|2 years ago
Well... given 2 frameworks equally pleasant to work with, why not using the one with the best performance ? (Whatever performance means... less cpu ? less memory ? better i/o ?).
As a Django user, working on problems where Django shines, I have yet to see such a solution in Rust, but that doesn't mean it won't happen one day.
karussell|2 years ago
Is the problem the Oracle involvement? Or is it not that fast as advertised or problems with the ecosystem (C libraries)?
thisgoodlife|2 years ago
“At this point, the Python runtime is made available for experimentation and curious end-users. “
https://www.graalvm.org/latest/reference-manual/python/
dunefox|2 years ago
cornholio|2 years ago
Take from Rust Algebraic types, ahead of time compilation and strong types, functional features, a borrow / escape checker that automatically turns shared data into Rc or Arc, as necessary, instead of tormenting me to rewrite performance irrelevant code;
Take from Python the simple syntax, default pass by reference of all non-numeric types, simplified string handling, unified slice and array syntax - and any other simplifying feature possible.
... and give me a fast, safe and powerful language that gets out of my way while maximizing the power of the compiler to prevent bugs. Golang was a commendable attempt, but they made '70s design decisions that condemned the language: mandatory garbage collection, nullables, (void*) masquerading as interface{} casts, under-powered compiler etc.
machiaweliczny|2 years ago
za3faran|2 years ago
On a side note, python is strictly pass by value. For non-primitives, their references are passed by value.
mijoharas|2 years ago
jcolella|2 years ago
brahbrah|2 years ago
Dowwie|2 years ago
saeranv|2 years ago
``` centers = get_centers(polgons) # M x 3 array close_idx = np.where( np.linalg.norm(centers - point, axis=1) < max_dist)[0] close_polygons = polygons[close_idx,:] ```
That's one reason I prefer for to use arrays for polygons, rather then abstract it into a Python object. Fundamentally geometries are sequences of points, and with some zero-padding to account for irregular point counts, you can still keep them in a nice, efficient array representation.
toxik|2 years ago
swyx|2 years ago
moreresearchplz|2 years ago
crabbone|2 years ago
Where does this nonsense come from? No. It isn't. It's just a stupid fashion. Something that should be discouraged, not encouraged by trying to make it work when it's obviously broken.
As someone who does work with researchers who do use Python a lot, I see the everyday painful experiences of people who use it. And this pain doesn't need to be there. It's just masochism. And the only real reason is that they don't know any better. The only other thing they know is Matlab, and that's even worse.
Python is just a bad language. Popular, but awful. Ironically, while researchers are supposed to be on the forerfront of discovery and technology... well, they aren't. Industry outpaced research. So much so that today there are government programs to onboard researchers into more automated and more automatically verified way to do research. And we aren't talking about making an elite force here. These programs are meant for people in research who copy data from Excel sheets one data point at a time into another spreadsheet. It's that kind of bad.
My wife happened to work in such a government center, and that's how I know about what's going on inside these programs. And it's very sad that decisions about the preferred tools for research automation are made by people who, unlike most of their peers, had some exposure to what happens in the industry, but had no deeper understanding of the reasons any particular technology ended up in any particular niche, nor any independent ability to assess the capabilities of any particular tech. It's really sad what's going on there.
ActorNightly|2 years ago
You couldn't be more wrong. Python is the language going forward. There is a reason why bleeding edge ML stuff is done through Python, as well as it being the backend to several very popular web platforms and is second most use language on github behind JS solely because JS is hard tied to web.
I have a feeling the hate for Python is just comes from paradigms that are taught in extremely poor CS curriculum in schools. If you think that Python is bad because of dynamic typing, you haven't been paying attention to the direction compute is going.
pahbloo|2 years ago
xgdgsc|2 years ago
shakow|2 years ago
wdroz|2 years ago
[0] -- https://github.com/rayon-rs/rayon
wcrossbow|2 years ago
flohofwoe|2 years ago
(e.g. it doesn't need to be "as fast as possible", just fast enough to no longer be a workflow bottleneck)
wanderingmind|2 years ago
masklinn|2 years ago
wenc|2 years ago
yjftsjthsd-h|2 years ago
I literally clicked in to read the article to see if they'd mention this:) But... unless I missed it, there wasn't really an answer? I thought numpy does do the heavy lifting in native code, so why is this faster? Does this version just push more of the logic into native code than numpy did?
alex_smart|2 years ago
Just that small change already gives 10x faster performance without ever leaving python/numpy land.
fbdab103|2 years ago
>It’s worth noting that converting parts of / everything to vectorized numpy might be possible for this toy library, but will be nearly impossible for the real library while making the code much less readable and modifiable...
anewhnaccount2|2 years ago
biomcgary|2 years ago
lmeyerov|2 years ago
From a maintenance view, I much prefer folks write vectorized data frames vs numpy or low level bindings, but that comes from having lived with the alternative for a lot longer. All of our exceptions are pretty much slotted for deletion. (Our fast path is single or multi-GPU dataframes in python.) Here's to hoping that one day we'll have dependent types in mypy!
osmanbaskaya|2 years ago
1MachineElf|2 years ago
gjourdvhiokhf|2 years ago
Piqued my curiosity as my background almost lines up, and I'm interested in that kind of role... But I live near one of their offices only hiring for skill sets I don't really have.
poulpy123|2 years ago
jacquesm|2 years ago
This sort of hybrid is super common, you typically spend 90%+ of your time in computationally intensive problems in a very small subset of your code, typically the innermost loops. Optimizing those will have very good pay-off.
Traditionally we'd do this with high level stuff in one language and then assembly for the performance critical parts, these days it is more likely a combination of a scripting language for the high level part and a compiled language for the low level parts (C, rust, whatever). Java and such as less suitable for such optimization purposes, both because they come with a huge runtime and because they are hard to interface to other languages unless they happen to use the same underlying VM, but then there usually isn't much performance gain.
Another nice way to optimize computationally intensive code is by finding out if the code is suitable for adaptation to the GPU, which can give you many orders of magnitude speed improvement in some cases.
brandonpelfrey|2 years ago
orangepurple|2 years ago
https://news.ycombinator.com/item?id=34207974
sandGorgon|2 years ago
i wasnt able to see the numba comparison. anyone know how much worse it was ?
mattbillenstein|2 years ago
kavalg|2 years ago
Heston|2 years ago
masklinn|2 years ago
If you’re going to benchmark scripts or executables, use hyperfine.
firechickenbird|2 years ago
smnrchrds|2 years ago
"It is big and complex and very business critical and highly algorithmic, so that would take ~months of work, ..."
Animats|2 years ago
That's doing spatial data processing by exaustive search, which is inherently slow. There are better algorithms. If the number of items to be searched is large, the spatial indices of MySQL could help.
why_only_15|2 years ago
laerus|2 years ago
rwalle|2 years ago
akasakahakada|2 years ago
You need more experienced Python engineer.
tus666|2 years ago
Hasn't he heard of ctypes? You can wrap C structs add Python objects since forever.
UncleEntity|2 years ago
I assume, haven’t really messed with numpy for anything but I can’t imagine it wouldn’t work that way.
unknown|2 years ago
[deleted]
isaacfrond|2 years ago
dang|2 years ago