top | item 36936017

(no title)

wmwmwm | 2 years ago

Historically I’ve written several services that load up some big datastructure (10s or 100s of GB), then expose an HTTP API on top of it. Every time I’ve done a quick implementation in Python of a service that then became popular (within a firm, so 100s or 1000s of clients) I’ve often ended up having to rewrite in Java so I can throw more threads at servicing the requests (often CPU heavy). I may have missed something but I couldn’t figure out how to get the multi-threaded performance out of Python but of course no-GIL looks interesting for this!

discuss

order

iknownothow|2 years ago

I would consider the following optimizations first before attempting to rewrite an HTTP API since you already did the hard part:

1. For multiples processes use `gunicorn` [1]. Runs your app across multiple processes without you having to touch your code much. It's the same as having the n instances of the same backend app where n being the number of CPU cores you're willing to throw at it. One backend process per core, full isolation.

2. For multiple threads use `gunicorn` + `gevent` workers [2]. Provides multiprocessing + multithreaded functionality out of the box if you have IO intensive. It's not perfect but works very well in some situations.

3. Lastly, if CPU is where you have a bottleneck, that means you have some memory to spare (even if it's not much). Throw some LRU cache or cachetools [3] over functions that return the same result or functions that do expensive I/O.

[1]: https://www.joelsleppy.com/blog/gunicorn-sync-workers/

[2]: https://www.joelsleppy.com/blog/gunicorn-async-workers-with-...

[3]: https://pypi.org/project/cachetools/

danpalmer|2 years ago

These don't really apply to the parent commenter's scenario.

1) gunicorn or any solution with multiple processes is going to just multiply the RAM usage. Using 10-100GB of RAM per effective thread makes this sort of problem very RAM bound, to the point that it can be hard to find hardware or VM support.

2) This isn't I/O bound.

3) If your service is fundamentally just looking up data in a huge in-memory data store, adding LRU caching around that is unlikely to make much of a difference because you're a) still doing a lookup in memory, just for the cache rather than the real data, and b) you're still subject to the GIL for those cache lookups.

I've also written services like this, we only loaded ~5GB of data, but it was sufficient to be difficult to manage in a few ways like this. The GIL-ectomy will probably have a significant impact on these sorts of use cases.

xmaayy|2 years ago

> 1. For multiples processes use `gunicorn`

This will load up multiple processes like you say. OP loads a large dataset and gUnicorn would copy that dataset in each process. I have never figured out shared memory with gUnicorn.

nwallin|2 years ago

> I may have missed something but I couldn’t figure out how to get the multi-threaded performance out of Python

Multiprocessing. The answer is to use the python multiprocessing module, or to spin up multiple processes behind wsgi or whatever.

> Historically I’ve written several services that load up some big datastructure (10s or 100s of GB), then expose an HTTP API on top of it.

Use the python multiprocessing module. If you've already written it with the multithreading module, it is a drop in replacement. Your data structure will live in shared memory and can be accessed by all processes concurrently without incurring the wrath of the GIL.

Obviously this does not fix the issue of Python just being super slow in general. It just lets you max out all your CPU cores instead of having just one core at 100% all the time.

RayVR|2 years ago

Multiprocessing is not a real solution, it’s a break-glass procedure when you just need to throw some cores at something without any hope for reliability. Unless something has changed since I used python, it is essentially a wrapper on Fork.

This means you need to deal with stuck/dead processes. I’ve used multiprocessing extensively and once you hit a certain amount of usage, even in a pool, you just get hangs and unresponsive processes.

I’ve also written a huge amount of Cython wrapped c++ code which releases the GIL. This never hangs and I can multithread there all I want without issue.

mort96|2 years ago

I want to warn people against multiprocessing in python though.

If you're thinking about parallelizing your Python process, chances are your Python code is CPU-bound. That's when you should stop and think, is Python really the right tool for this job?

From experience, translating a Python program into C++ or Rust often gives a speed-up of around 100x, without introducing threads. Go probably has a similar level of speed-up. So while you can throw a lot of time fighting Python to get it to consume 16x the compute resources for a 10x speed-up, you could often instead spend a similar amount of time rewriting the program for a 100x speed-up with the same compute resources. And then you could parallelize your Go/Rust/C++ program for another 10x, if necessary.

Of course, this is highly dependent on what you're actually doing. Maybe your Python code isn't the bottleneck, maybe your code spends 99% of its time in datastructure operations implemented in C and you need to parallelize it. Or maybe your use-case is one where you could use pypy and get the required speed-up. I just recognize from my own experience the temptation of parallelizing some Python code because it's slow, only to find that the parallelized version isn't that much faster (my computer is just hotter and louder), and then giving in and rewriting the code in C++.

coldtea|2 years ago

>Use the python multiprocessing module. If you've already written it with the multithreading module, it is a drop in replacement. Your data structure will live in shared memory

Only if it can be immutable. So it can't be shared and changed by multiple processes as needed (with synchronization).

And even if you can have it mostly immutable, if you need to refresh it (e.g. after some time read a newer large file from disk to load into your data structure), you can't without restarting the whole server and processes.

So, it could work for this case, but it's hardly a general solution for the problem.

alfalfasprout|2 years ago

Nowadays multiprocessing is rarely the answer. Between all the gotchas (memory usage can be horrific, have to be careful what you modify, etc.) it's almost never the right answer.

Nowadays numba is usually a better solution for when you want to run some computationally expensive python code that itself calls numpy, etc.

For the parent commenter's use case though that wouldn't be a great solution either. In general, Python does not have an optimal way of operating on a shared data structure across OS threads and certainly not in a way that doesn't require forking the interpreter.

dekhn|2 years ago

Over quite some time I've become convinced multiprocessing module is better than an optional GIL removal.

It may leave many useful bits on the table (compared to pure multithreaded coding, like C++/pthreads) but I've still been able to get it to scale my application performance (CPU-bound, large-memory) to the number of cores of even large boxes (96+ vCPUs). IIRC the future/concurrent library was key to being productive.

20 years ago I would said different, as at the time, IronPython demonstrated a real alternative to CPython that was faster, and fully multitrhreaded (including the container classes).

amrx101|2 years ago

I dont really partake in programming "wars", but the idea of launching a set of separate processes instead of separate threads to do a bunch of IOs has always seem to be weird to me. Yes, I have built software using Python. Yes, I have done things as you suggest. Now I use asyncio, since the syntax has matured and I finally understand coroutines, runners, tasks etc. Lets see where the GIL less Python takes us.

scrozart|2 years ago

Yup. I work at the Space Telescope Science Institute, where we maintain pipelines for astronomical data that move petabytes, among other things. All of the heavy lifting is done in Python.

the8472|2 years ago

Loading 100GB into RAM and then calling fork() is just painting a giant OOM Killer target on your back. It'll work until something breaks the CoWs or the parent gets restarted while some forks still linger or other fun things like that.

Threads make it transparent to the OS that this memory really must be shared between compute tasks.

godelski|2 years ago

This exists, but one of two things happen, which still significantly slows things down. Either 1) you generate multiple python instances or 2) you push the code to a different language. Both are cumbersome and have significant effects. The latter is more common in computational libraries like numpy or pytorch, but in this respect it is more akin to python being a wrapper for C/C++/Cuda. Your performance is directly related to the percentage of time your code spends within those computation blocks otherwise you get hammered by IO operations.

oivey|2 years ago

You have to manually set up shared memory with its own API that has its own limitations, right? I thought some seamless integration was a new feature, but AFAICT, transfers between multiprocesses still leads to things being pickled and copied. Am I wrong?

whywhywhydude|2 years ago

If you have a non trivial application, multiprocessing just takes a lot of memory. Every child process that you create duplicates the parent memory. There are some interesting hacks like gc.freeze that exploits the copy on write feature of forks to reduce memory, but ultimately you can just create a few hundred of processes compared to thousands of threads because of memory consumption.

nine_k|2 years ago

Multiprocessing is great. But then every process keeps its own copy of hundreds of gigabytes of stuff. May be okay, depending on how many processes you spawn.

If the bulk of the data is immutable (or at least never mutated), it can be safely shared though, via shared memory.

AlphaSite|2 years ago

Python is also going to get a JIT eventually, so they’re fixing that too! One of the concerns with no gil was that it would make certain optimisations harder for the JIT, but it’s very cool to see both being worked on.

bmitc|2 years ago

Or just use a language that was actually designed to be something other than a scripting language?

jgalt212|2 years ago

> Multiprocessing. The answer is to use the python multiprocessing module, or to spin up multiple processes behind wsgi or whatever.

I assume mod_wsgi under apache was not the answer here due to memory constraints. That being said, why not serve from disk and use redis for a cache. This should work well unless the queries had high cardinality.

Waterluvian|2 years ago

No, that’s about right.

The response, which isn’t technically wrong, is “unless you’re CPU bound, your application should be parallized with a WSGI. You shouldn’t be loading all that up in memory so it shouldn’t matter that you run 5 Python processes that each handle many many concurrent I/O bound requests.”

And this is kinda true… I’ve done it a lot. But it’s very inflexible. I hate programming architectures/patterns/whatnot where the answer is “no you’re doing it wrong. You shouldn’t be needing gigs of memory for your web server. Go learn task queues or whatever.” They’re not always wrong, but very regularly it’s the wrong time to worry about such “anti patterns.”

dotnet00|2 years ago

Yes, this is even more the case in languages that are popular with more "applied" programming audiences, like scientific computing. Telling them "no you should be using this complicated DBMS" (or whatever other acronym) is not productive.

It tends to get them exceptionally mad because their concern isn't the ideal way to write the code and architect the system, they simply want to write just enough code to continue their research, and even if they did care about proper architecture, they don't have the time or interest in learning/testing a new library for every little thing. They'd rather be putting that time reading up on their field of research.

knorker|2 years ago

Well, it's like showing your plan for painting a room, and asking "I seem to get stuck here after painting all but the corner, how do I get out of the corner?". The answer actually is "don't leave the corner for last".

Or like the martial arts student asking the master "how do I fight a guy 100m away with a rifle?" - "don't be there".

threatripper|2 years ago

You have a single big data structure that can't be shared easily between multiple processes. Can't you use multiprocessing with that? Maybe mapping the data structure to a file and mmapping that in multiple processes? Maybe wrapping the whole thing in database instead of just using one huge nested dictionary? To me multi-threading sounds so much less painful than all the alternatives that I could imagine. Just adding multi-threading could give you >10x improvement on current hardware without much extra work if your data structure plays nice.

dathinab|2 years ago

> You have a single big data structure that can't be shared easily between multiple processes. Can't you use multiprocessing with that? Maybe mapping the data structure to a file and mmapping that in multiple processes? Maybe wrapping the whole thing in database instead of just using one huge nested dictionary?

ton of additional complexity, not worth it for many use-cases and anything on the line of "using multiple processes or threads to increase python performance" does have (or at least did have) quite a bunch of additional foot guns in python

In that context porting a very trivial ad-hoc application to Java (or C# or Rust, depending on what knowhow exist in the Team) would faster or at least not much slower to do. But it would be reliable estimable by reducing the chance for any unexpected issues, like less perf then expected.

Basically the moment "use mmap" or "use multi-processing" is a reasonable recommendation for something ad-hocish there is something rally wrong with the tools you use IMHO.

kroolik|2 years ago

One annoying part with multiprocessing in Python is that you could abuse the COW mechanism to save on loading time when forking. But Python stores ref counters together with objects so every single read will bust your COW cache.

Now, you wanted it simple, but got to fight with the memory model of a language that wasn't designed with performance in mind, for programs whose focus wasn't performance.

TylerE|2 years ago

I'd go for a db, yeah, or if that's a really painful mapping, this, erm, is actually the sort of thing Go is pretty good at it, and it's not too hard to write a fairly simple program that will traverse your data structure and communicate via a JSON api or something. That's a useful technique in general - separate the big heavy awkward thing from your main web processes.

While I hate how verbose and inexpressive it is, Go does hit a sweet spot of fairly good performance, even multi-core, while still being GCed so it's not nearly as foreign for a native python user.

SanderNL|2 years ago

It sounds I/O heavy, but you mention it being CPU-heavy in which case I’d say Python is just not the right tool for the job although you may be able to cope with multiprocessing.

jeremycarter|2 years ago

Similar experience. Even with multi process and threads python is slow, very slow. Java, Go and .NET all provide a very performant out of box experience.

__d|2 years ago

Python is both an interpreter, and quite dynamic. Both of these lead to lower performance when compared to less dynamic, compiled solutions. All of Java, Go, and .NET are compiled and (much) less dynamic.

This is absolutely an expected outcome.

ActorNightly|2 years ago

3.11 and on should be comparable to Java for most use cases with multiprocessing (set up correctly of course)

strictfp|2 years ago

My tip for this is Node.js and some stream processing lib like Highland. You can get ridiculous IO parallelism with a very little code and a nice API.

Python just scales terribly, no matter if you use multi-process or not. Java can get pretty good perf, but you'll need some libs or quite a bit of code to get nonblocking IO sending working well, or you're going to eat huge amounts of resources for moderate returns.

Node really excels at this use case. You can saturate the lines pretty easily.

hughesjj|2 years ago

0_o

Did I miss something? Does nodes/highland have good shared memory semantics these days?

I've always felt the best analogy to python concurrency was (node)js, but I admittedly haven't kept up all that well.

goatlover|2 years ago

Wouldn't Elixir or Go be better for this use case? Node still blocks on compute heavy tasks.

porridgeraisin|2 years ago

I think they mentioned CPU intensive work, which I'm taking to imply that it's more CPU bound than I/O bound. So unless you're suggesting they use Node's web workers implementation for parallelism, the default single threaded async concurrency model probably won't serve them well.

pid-1|2 years ago

Isn't Node single threaded, just like Python?

rrishi|2 years ago

I am not too deeply experienced with Python so forgive my ignorance.

But I am curious to understand why you were not able to utilize the concurrency tools provided in Python.

A quick google search gave me these relevant resources

1. An intro to threading in Python (https://realpython.com/intro-to-python-threading/#conclusion...)

2. Speed Up Your Python Program With Concurrency (https://realpython.com/python-concurrency/)

3. Async IO in Python: A Complete Walkthrough (https://realpython.com/async-io-python/)

Forgive me for my naivety. This topic has been bothering me for quite a while.

Several people complain about the lack of threading in Python but I run into plenty of blogs and books on concurrency in Python.

Clearly there is a lack in my understanding of things.

Jtsummers|2 years ago

Re (3): asyncio does not give you a boost for CPU bound tasks. It's a single-threaded, cooperative multi-tasking system that can (if you're IO bound) give you a performance boost.

wmwmwm|2 years ago

You can throw python threads at it, but if each request traverses the big old datastructure using python code and serialises a result then you’re stuck with only one live thread at a time (due to the GIL). In Java it’s so much easier especially if the datastructure is read only or is updated periodically in an atomic fashion. Every attempt to do something like this in python has led me to having to abandon nice pythonic datastructures, fiddle around with shared memory binary formats, before sighing and reaching for java! Especially annoying if the service makes use of handy libraries like numpy/pandas/scipy etc!

teraflop|2 years ago

The whole point of the GIL is that even if you use Python's threading or asyncio, you don't get any benefits from scaling beyond a single CPU core, because all of your threads (or coroutines) are competing for a single lock. They run "concurrently", but not actually in parallel. The pages you linked explain this in more detail.

In theory, multiprocessing could allow you to distribute the workload, but in a situation like OP describes -- just serving API requests based on a data structure -- the overhead of dispatching requests would likely be bigger than the cost of just handling the request in the first place. And your main server process is still a bottleneck for actually parsing the incoming requests and sending responses. So you're unlikely to see a significant benefit.

aardvark179|2 years ago

Threading in Python is fine if your threads are io bound or spend their time in a C extension which releases the GIL, if you are bound then the GIL means effectively one thread can run at a time and you gain no advantage from multiple threads.

indeedmug|2 years ago

I had this misunderstanding for a long time until I saw Go explain the difference: https://go.dev/blog/waza-talk

The confusion here is parallelism vs concurrency. Parallelism is executing multiple tasks at once and concurrency is the composition of multiple tasks.

For example, imagine there is a woodshop with multiple people and there is only one hammer. The people would be working on their projects such as a chair, a table, etc. Everyone needs to use the hammer to continue their project.

If someone needed a hammer, they would take the single hammer and use it. There are still other projects going on but everyone else would have to wait until the hammer is free. This is concurrency but not parallelism.

If there are multiple hammers, then multiple people could use the hammer at the same time and their project continues. This is parallelism and concurrency.

The hammer here is the CPU and the multiple projects are threads. When you have Python concurrency, you are sharing the hammer across different projects, but it's still one hammer. This is useful for dealing with blocking I/O but not computing bottlenecks.

Let's say that one of the projects needs wood from another place. There is no point in this project to hold on to the hammer when waiting for wood. This is what those Python concurrency libraries are solving for. In real life, you have tasks waiting on other services such as getting customer info from a database. You don't want the task to be wasting the CPU cycles doing nothing, so we can pass the CPU to another task.

But this doesn't mean that we are using more of the CPU. We are still stuck with a single core. If we have a compute bottleneck such as calculating a lot of numbers, then the concurrency libraries don't help.

You might be wondering why Python only allows for a single hammer/CPU core. It's because it's very hard to get parallelism properly working, you can end up with your program stalling easily if you don't do it correctly. The underlying data structures of Python were never designed with that in mind because it was meant to be a scripting language where performance wasn't key. Python grew massive and people started to apply Python to areas where performance was key. It's amazing that Python got so far even with GIL IMO.

As an aside, you might read about "multiprocessing" Python where you can use multiple CPU cores. This is true but there are heavy overhead costs to this. This is like building brand-new workshops with single hammers to handle more projects. This post would get even longer if I explained what is a "process" but to put it shortly, it is how the OS, such as Windows or Linux, manages tasks. There is a lot of overhead with it because it is meant to work with all sorts of different programs written in different languages.

wood_spirit|2 years ago

That’s right.

In the past, for read-only data, I’ve used a disk file and relied on the the OS page cache to keep it performant.

For read-write, using a raw file safely gets risky quickly. And alternative languages with parallelism runs rings around python.

So getting rid of the GIL and allowing parallelism will be a big boon.

xcv123|2 years ago

> I may have missed something

You did not miss anything. The GIL prevents parallel multi threading.

brightball|2 years ago

This is actually one of the reasons I was drawn to Ruby over Python. Ruby also has the GIL but jRuby is an excellent option when needed.

antod|2 years ago

I wonder what lead to JRuby attracting support while Jython not? I know the Jython creator went on to other things (was it eg IronPython for dotnet?). I suppose it was the inverse with dotnet - eg IronPython surviving while IronRuby seems dead.

Is it just down to corporate sponsorship?

severino|2 years ago

May I ask why you didn't consider writing that quick implementation in Java in the first place?

datadeft|2 years ago

I don't think that Python was designed for this. I found it largely unsuited for such work. It is much easier to saturate IO with (random order) F#, Rust or Java (that I have used for in scenarios you mentioned).

nesarkvechnep|2 years ago

If your data doesn't change, you can leverage HTTP caching and lift a huge burden off of your service.

TylerE|2 years ago

Spin up as many processes as you need, map connections 1:1 to processes if possible.

lfkdev|2 years ago

You could have just use gunicorn and spawn multiple workers maybe

vorticalbox|2 years ago

Why not load the data into sqlite dB and let the clients query that? Is there a reason you're loading 10s/100s gb into memory?

qbasic_forever|2 years ago

Are you just reading from this data structure? If so I wouldn't do any locking or threading, I'd just use asyncio to serve up read requests to the data and it should scale quite well. Multithreading/processing is best for CPU limited workloads but this sounds like you're really just IO-bound (limited by the very high IO of reading from that data structure in memory).

If you're allowing writes to the shared data structure... I'd ask myself am I using the right tool for the job. A proper database server like postgres will handle concurrent writers much, much better than you could code up hastily. And it will handle failures, backups, storage, security, configuration, etc. far better than an ad hoc solution.

Jtsummers|2 years ago

> I'd just use asyncio to serve up read requests to the data and it should scale quite well.

Quoting GP:

>> often CPU heavy

We have to take their word for it that it's actually CPU heavy work, but if they're not lying and not mistaken then asyncio would do nothing for them.

tsimionescu|2 years ago

Reading from memory is really not IO. Perhaps you're suggesting doing something like mmapping a file to memory, putting the data structure in that memory, and then using asyncio on the file to serve things, but this would only work if you can compute byte ranges inside the file to serve ahead of time, in which case there are much simpler solutions anyway. Most likely, when receiving a query they need to actually search through the datastructure based on the query, and it's very likely that this is the bottleneck, not just reading some memory.