(no title)
wmwmwm
|
2 years ago
Historically I’ve written several services that load up some big datastructure (10s or 100s of GB), then expose an HTTP API on top of it. Every time I’ve done a quick implementation in Python of a service that then became popular (within a firm, so 100s or 1000s of clients) I’ve often ended up having to rewrite in Java so I can throw more threads at servicing the requests (often CPU heavy). I may have missed something but I couldn’t figure out how to get the multi-threaded performance out of Python but of course no-GIL looks interesting for this!
iknownothow|2 years ago
1. For multiples processes use `gunicorn` [1]. Runs your app across multiple processes without you having to touch your code much. It's the same as having the n instances of the same backend app where n being the number of CPU cores you're willing to throw at it. One backend process per core, full isolation.
2. For multiple threads use `gunicorn` + `gevent` workers [2]. Provides multiprocessing + multithreaded functionality out of the box if you have IO intensive. It's not perfect but works very well in some situations.
3. Lastly, if CPU is where you have a bottleneck, that means you have some memory to spare (even if it's not much). Throw some LRU cache or cachetools [3] over functions that return the same result or functions that do expensive I/O.
[1]: https://www.joelsleppy.com/blog/gunicorn-sync-workers/
[2]: https://www.joelsleppy.com/blog/gunicorn-async-workers-with-...
[3]: https://pypi.org/project/cachetools/
danpalmer|2 years ago
1) gunicorn or any solution with multiple processes is going to just multiply the RAM usage. Using 10-100GB of RAM per effective thread makes this sort of problem very RAM bound, to the point that it can be hard to find hardware or VM support.
2) This isn't I/O bound.
3) If your service is fundamentally just looking up data in a huge in-memory data store, adding LRU caching around that is unlikely to make much of a difference because you're a) still doing a lookup in memory, just for the cache rather than the real data, and b) you're still subject to the GIL for those cache lookups.
I've also written services like this, we only loaded ~5GB of data, but it was sufficient to be difficult to manage in a few ways like this. The GIL-ectomy will probably have a significant impact on these sorts of use cases.
xmaayy|2 years ago
This will load up multiple processes like you say. OP loads a large dataset and gUnicorn would copy that dataset in each process. I have never figured out shared memory with gUnicorn.
nwallin|2 years ago
Multiprocessing. The answer is to use the python multiprocessing module, or to spin up multiple processes behind wsgi or whatever.
> Historically I’ve written several services that load up some big datastructure (10s or 100s of GB), then expose an HTTP API on top of it.
Use the python multiprocessing module. If you've already written it with the multithreading module, it is a drop in replacement. Your data structure will live in shared memory and can be accessed by all processes concurrently without incurring the wrath of the GIL.
Obviously this does not fix the issue of Python just being super slow in general. It just lets you max out all your CPU cores instead of having just one core at 100% all the time.
RayVR|2 years ago
This means you need to deal with stuck/dead processes. I’ve used multiprocessing extensively and once you hit a certain amount of usage, even in a pool, you just get hangs and unresponsive processes.
I’ve also written a huge amount of Cython wrapped c++ code which releases the GIL. This never hangs and I can multithread there all I want without issue.
mort96|2 years ago
If you're thinking about parallelizing your Python process, chances are your Python code is CPU-bound. That's when you should stop and think, is Python really the right tool for this job?
From experience, translating a Python program into C++ or Rust often gives a speed-up of around 100x, without introducing threads. Go probably has a similar level of speed-up. So while you can throw a lot of time fighting Python to get it to consume 16x the compute resources for a 10x speed-up, you could often instead spend a similar amount of time rewriting the program for a 100x speed-up with the same compute resources. And then you could parallelize your Go/Rust/C++ program for another 10x, if necessary.
Of course, this is highly dependent on what you're actually doing. Maybe your Python code isn't the bottleneck, maybe your code spends 99% of its time in datastructure operations implemented in C and you need to parallelize it. Or maybe your use-case is one where you could use pypy and get the required speed-up. I just recognize from my own experience the temptation of parallelizing some Python code because it's slow, only to find that the parallelized version isn't that much faster (my computer is just hotter and louder), and then giving in and rewriting the code in C++.
coldtea|2 years ago
Only if it can be immutable. So it can't be shared and changed by multiple processes as needed (with synchronization).
And even if you can have it mostly immutable, if you need to refresh it (e.g. after some time read a newer large file from disk to load into your data structure), you can't without restarting the whole server and processes.
So, it could work for this case, but it's hardly a general solution for the problem.
alfalfasprout|2 years ago
Nowadays numba is usually a better solution for when you want to run some computationally expensive python code that itself calls numpy, etc.
For the parent commenter's use case though that wouldn't be a great solution either. In general, Python does not have an optimal way of operating on a shared data structure across OS threads and certainly not in a way that doesn't require forking the interpreter.
dekhn|2 years ago
It may leave many useful bits on the table (compared to pure multithreaded coding, like C++/pthreads) but I've still been able to get it to scale my application performance (CPU-bound, large-memory) to the number of cores of even large boxes (96+ vCPUs). IIRC the future/concurrent library was key to being productive.
20 years ago I would said different, as at the time, IronPython demonstrated a real alternative to CPython that was faster, and fully multitrhreaded (including the container classes).
amrx101|2 years ago
scrozart|2 years ago
the8472|2 years ago
Threads make it transparent to the OS that this memory really must be shared between compute tasks.
godelski|2 years ago
oivey|2 years ago
whywhywhydude|2 years ago
nine_k|2 years ago
If the bulk of the data is immutable (or at least never mutated), it can be safely shared though, via shared memory.
AlphaSite|2 years ago
bmitc|2 years ago
jgalt212|2 years ago
I assume mod_wsgi under apache was not the answer here due to memory constraints. That being said, why not serve from disk and use redis for a cache. This should work well unless the queries had high cardinality.
Waterluvian|2 years ago
The response, which isn’t technically wrong, is “unless you’re CPU bound, your application should be parallized with a WSGI. You shouldn’t be loading all that up in memory so it shouldn’t matter that you run 5 Python processes that each handle many many concurrent I/O bound requests.”
And this is kinda true… I’ve done it a lot. But it’s very inflexible. I hate programming architectures/patterns/whatnot where the answer is “no you’re doing it wrong. You shouldn’t be needing gigs of memory for your web server. Go learn task queues or whatever.” They’re not always wrong, but very regularly it’s the wrong time to worry about such “anti patterns.”
dotnet00|2 years ago
It tends to get them exceptionally mad because their concern isn't the ideal way to write the code and architect the system, they simply want to write just enough code to continue their research, and even if they did care about proper architecture, they don't have the time or interest in learning/testing a new library for every little thing. They'd rather be putting that time reading up on their field of research.
knorker|2 years ago
Or like the martial arts student asking the master "how do I fight a guy 100m away with a rifle?" - "don't be there".
threatripper|2 years ago
dathinab|2 years ago
ton of additional complexity, not worth it for many use-cases and anything on the line of "using multiple processes or threads to increase python performance" does have (or at least did have) quite a bunch of additional foot guns in python
In that context porting a very trivial ad-hoc application to Java (or C# or Rust, depending on what knowhow exist in the Team) would faster or at least not much slower to do. But it would be reliable estimable by reducing the chance for any unexpected issues, like less perf then expected.
Basically the moment "use mmap" or "use multi-processing" is a reasonable recommendation for something ad-hocish there is something rally wrong with the tools you use IMHO.
kroolik|2 years ago
Now, you wanted it simple, but got to fight with the memory model of a language that wasn't designed with performance in mind, for programs whose focus wasn't performance.
TylerE|2 years ago
While I hate how verbose and inexpressive it is, Go does hit a sweet spot of fairly good performance, even multi-core, while still being GCed so it's not nearly as foreign for a native python user.
SanderNL|2 years ago
jeremycarter|2 years ago
__d|2 years ago
This is absolutely an expected outcome.
ActorNightly|2 years ago
strictfp|2 years ago
Python just scales terribly, no matter if you use multi-process or not. Java can get pretty good perf, but you'll need some libs or quite a bit of code to get nonblocking IO sending working well, or you're going to eat huge amounts of resources for moderate returns.
Node really excels at this use case. You can saturate the lines pretty easily.
hughesjj|2 years ago
Did I miss something? Does nodes/highland have good shared memory semantics these days?
I've always felt the best analogy to python concurrency was (node)js, but I admittedly haven't kept up all that well.
goatlover|2 years ago
porridgeraisin|2 years ago
pid-1|2 years ago
rrishi|2 years ago
But I am curious to understand why you were not able to utilize the concurrency tools provided in Python.
A quick google search gave me these relevant resources
1. An intro to threading in Python (https://realpython.com/intro-to-python-threading/#conclusion...)
2. Speed Up Your Python Program With Concurrency (https://realpython.com/python-concurrency/)
3. Async IO in Python: A Complete Walkthrough (https://realpython.com/async-io-python/)
Forgive me for my naivety. This topic has been bothering me for quite a while.
Several people complain about the lack of threading in Python but I run into plenty of blogs and books on concurrency in Python.
Clearly there is a lack in my understanding of things.
Jtsummers|2 years ago
wmwmwm|2 years ago
teraflop|2 years ago
In theory, multiprocessing could allow you to distribute the workload, but in a situation like OP describes -- just serving API requests based on a data structure -- the overhead of dispatching requests would likely be bigger than the cost of just handling the request in the first place. And your main server process is still a bottleneck for actually parsing the incoming requests and sending responses. So you're unlikely to see a significant benefit.
aardvark179|2 years ago
indeedmug|2 years ago
The confusion here is parallelism vs concurrency. Parallelism is executing multiple tasks at once and concurrency is the composition of multiple tasks.
For example, imagine there is a woodshop with multiple people and there is only one hammer. The people would be working on their projects such as a chair, a table, etc. Everyone needs to use the hammer to continue their project.
If someone needed a hammer, they would take the single hammer and use it. There are still other projects going on but everyone else would have to wait until the hammer is free. This is concurrency but not parallelism.
If there are multiple hammers, then multiple people could use the hammer at the same time and their project continues. This is parallelism and concurrency.
The hammer here is the CPU and the multiple projects are threads. When you have Python concurrency, you are sharing the hammer across different projects, but it's still one hammer. This is useful for dealing with blocking I/O but not computing bottlenecks.
Let's say that one of the projects needs wood from another place. There is no point in this project to hold on to the hammer when waiting for wood. This is what those Python concurrency libraries are solving for. In real life, you have tasks waiting on other services such as getting customer info from a database. You don't want the task to be wasting the CPU cycles doing nothing, so we can pass the CPU to another task.
But this doesn't mean that we are using more of the CPU. We are still stuck with a single core. If we have a compute bottleneck such as calculating a lot of numbers, then the concurrency libraries don't help.
You might be wondering why Python only allows for a single hammer/CPU core. It's because it's very hard to get parallelism properly working, you can end up with your program stalling easily if you don't do it correctly. The underlying data structures of Python were never designed with that in mind because it was meant to be a scripting language where performance wasn't key. Python grew massive and people started to apply Python to areas where performance was key. It's amazing that Python got so far even with GIL IMO.
As an aside, you might read about "multiprocessing" Python where you can use multiple CPU cores. This is true but there are heavy overhead costs to this. This is like building brand-new workshops with single hammers to handle more projects. This post would get even longer if I explained what is a "process" but to put it shortly, it is how the OS, such as Windows or Linux, manages tasks. There is a lot of overhead with it because it is meant to work with all sorts of different programs written in different languages.
unknown|2 years ago
[deleted]
wood_spirit|2 years ago
In the past, for read-only data, I’ve used a disk file and relied on the the OS page cache to keep it performant.
For read-write, using a raw file safely gets risky quickly. And alternative languages with parallelism runs rings around python.
So getting rid of the GIL and allowing parallelism will be a big boon.
xcv123|2 years ago
You did not miss anything. The GIL prevents parallel multi threading.
brightball|2 years ago
antod|2 years ago
Is it just down to corporate sponsorship?
unknown|2 years ago
[deleted]
severino|2 years ago
datadeft|2 years ago
nesarkvechnep|2 years ago
TylerE|2 years ago
lfkdev|2 years ago
vorticalbox|2 years ago
qbasic_forever|2 years ago
If you're allowing writes to the shared data structure... I'd ask myself am I using the right tool for the job. A proper database server like postgres will handle concurrent writers much, much better than you could code up hastily. And it will handle failures, backups, storage, security, configuration, etc. far better than an ad hoc solution.
Jtsummers|2 years ago
Quoting GP:
>> often CPU heavy
We have to take their word for it that it's actually CPU heavy work, but if they're not lying and not mistaken then asyncio would do nothing for them.
tsimionescu|2 years ago