Has the Python GIL Been Slain? Subinterpreters in Python 3.8

[+] gmueckl|6 years ago|reply

Hm, this solution seems very cumbersome, inelegant and not like python's "batteries included" approach at all. This means that python will have native threads that behhave as expected minus true parallel execution, so you shouldn't use those, even though the interface is fairly simple. Instead, you should learn to use this weird contraption that is neither multiprocessing nor intuitive multithreading and comes with a cumbersome interface.

I get that the GIL is a very hard problem to solve, but this solution is so inelegant in my eyes that python would be better off without it. I'd feel better if this was a hidden implementation detail that coukd be improved transparently. Just my two cents.

[+] coldtea|6 years ago|reply

>This means that python will have native threads that behhave as expected minus true parallel execution, so you shouldn't use those, even though the interface is fairly simple.

Python already has exactly that, and has had that for ages.

>Instead, you should learn to use this weird contraption that is neither multiprocessing nor intuitive multithreading and comes with a cumbersome interface.

It also comes with performance improvements over multiprocess, so there's that.

Besides the "cumbersome interface" is irrelevant, as it would be easy to wrap and forget about it, the same way nobody really uses urllib directly.

[+] ru999gol|6 years ago|reply

I have the same opinion about asyncio, its such a bad API that its almost impossible to use correctly. But still, probably better than nothing.

[+] akvadrako|6 years ago|reply

I completely disagree - Python threads are basically "green threads", so they have their place but aren't related to parallelisation. But true multiprocessing is ugly when you have hundreds of cores, which is where CPUs are going. There is no standard UI convention on most OSes to group those processes per app, in terms of signals or stats or whatever.

So besides the unproven possibility of removing the GIL, subinterpreters are the best way forward, better than threads or the multiprocessing package.

[+] pmontra|6 years ago|reply

It's somewhat similar to the GIL removal effort in Ruby [1]

They are isolating the GIL into Guilds there, which are containers for language threads sharing the same GIL. They are providing two primitives for communication between threads in different guilds. Send, for immutable data (zero copy) and move, for mutable data (copy). They remove the need for the boiler plate code for marshalling and unmarshalling. However I bet that there will be some library to hide that code in Python too.

[1] http://www.atdot.net/%7Eko1/activities/2018_RubyElixirConfTa...

[+] Animats|6 years ago|reply

Now that's an interesting approach.

I proposed something similar for Python 9 years ago.[1] Guido didn't like it.

Objects would be either thread-local, shared and locked, or immutable. Thread-local objects must be totally inaccessible from other threads, and not leakable across thread boundaries, for memory safety. (Python has "thread local" objects now, but it's just naming, and not airtight against leaks. You can assign a thread-local object to a global variable.) Shared and locked objects lock when you enter, unlock when you leave. Objects are thread-local by default, so single-thread programs work as before.

Minimize shared and locked, while using thread-local or immutable objects as much as possible. Locking is needed only for shared and locked objects.

This is almost conventional wisdom today, but 9 years ago, it was too radical.

Retrofitting concurrency is never pretty. But we have to. Individual CPUs are about the same speed per thread that they were a decade ago.

[1] http://animats.com/papers/languages/pythonconcurrency.html

[+] riffraff|6 years ago|reply

IIUC, python's sub-interpreters won't have a `move`.

That might not be a bad idea because I am worried `move` will end up being problematic in ruby, but time will tell.

[+] FartyMcFarter|6 years ago|reply

> This, in turn, means that Python developers can utilize async code, multi-threaded code and never have to worry about acquiring locks on any variables or having processes crash from deadlocks.

Dangerous advice. Whether this is true or not depends on lots of things such as how many and which operations you're doing on those variables.

Sure, CPython might do lots of simple operations atomically, but this is not enough to avoid the need for all locks. Threads can still interleave their execution in many ways.

[+] tasubotadas|6 years ago|reply

The current state of threading and parallel processing in Python is a joke. While they are still clinging to the GIL and single core performance, the rest of the world is moving to 32 core (consumer) CPUs.

Python's performance, in general, is a crappy[1] and is beaten even by PHP these days. All the people that suggest relying on multiprocessing probably haven't done anything that's CPU and Memory intensive because if you have a code that operates on a "world-state" each new process will have to copy that from a parent. If the state takes ~10GB each process will multiply that.

Others keep suggesting Cython. Well, guess what? If I am required to use another programming language to use threads, I might as well go with Go/Rust/Java instead and save the trouble of dabbling with two languages.

So where does that leave (pure-)Python? It can only be used in I/O bound applications where the performance of the VM itself doesn't matter. So it's basically only used by web/desktop applications that CRUD the databases.

It's really amazing that the machine learning community has managed to hack around that with C-based libraries like SciPy and NumPy. However, my suggestion would be to drop GIL and copy the whatever model has been working for Go/Java/C#. If you can't drop GIL because some esoteric features depend on that, then drop them as well.

[1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

[+] AlexTWithBeard|6 years ago|reply

Cython is nice, but debugging it requires gdb. For the PyCharm-loving end-users it may be quite cumbersome.

Those recommending to use multiprocessing have probably never been in that bitter spot when serializing something and computing something takes exactly the same time.

Also forking didn't really work until Python 3.6.

[+] dual_basis|6 years ago|reply

The consistent requirement has been that Python will drop the GIL for anything that doesn't make single-threaded performance suffer. There has been substantial work to this end but no solution to date has achieved this goal.

[+] unknown|6 years ago|reply

[deleted]

[+] gray_-_wolf|6 years ago|reply

> If the state takes ~10GB each process will multiply that.

In POSIX there is such thing as copy-on-write memory during forks.. So if that state is mostly read-only, additional memory required by each slave should be minimal.

[+] juststeve|6 years ago|reply

> Go/Rust/Java

And there's also Kotlin.

[+] olliej|6 years ago|reply

This is essentially the same concurrency model as Workers in JS engines - on the one hand it’s a fairly limiting crutch[1], on the other hand it is harder to create a bunch of different classes of concurrency bugs.

[1] vs fully shared state of C-like, .NET, JVM, etc, etc. Rust’s no-shared-mutable state model allows it to do some fun stuff but python (and JS) don’t really have a strong concept of mutable vs immutable, let alone ownership so I don’t think it would be applicable?

[+] Animats|6 years ago|reply

This is just a way to do the same thing as "multiprocessing", but with less memory usage. You still have multiple Python instances that send messages back and forth.

I wonder if they ever fixed the CPickle bug which broke it if you were using CPickle from multiple threads.

[+] loeg|6 years ago|reply

Yeah, it's got some of the same weaknesses as multiprocessing (and several new ones). Conceivably you could provide an API for handing off objects to the other interpreter without copying. I'm imagining an API like:

  my_foo = interpreterX.pass_object(my_foo)

(The assignment being required to delete the originating reference from the source interpreter.) The interface would be obligated to check that there are no references that escape to the current interpreter and then my_foo and all referenced objects could be handed off to the other interpreter in whole.

I don't have any intuitions for if that would be cheaper than copying or not, and getting it right is certainly more difficult than serialization. (Because of the complexity, it's not worth having if it isn't cheaper.)

[+] mintplant|6 years ago|reply

Less memory usage, and - hopefully - without all the quirks that crop up with multiprocessing. Off the top of my head: subprocesses don't always want to die along with the main process; error conditions can cause the underlying IPC layer to end up in a permanently stalled state.

[+] gigatexal|6 years ago|reply

No, Mr. Click-baity-title it’s not. They’re still there just you can use many interpreters now like one would when using the multiprocessing module. I do like the idea of Go-like queues for message passing.

[+] sbierwagen|6 years ago|reply

Betteridge's law of headlines is an adage that states: "Any headline that ends in a question mark can be answered by the word no."

https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headline...

[+] yingw787|6 years ago|reply

From my limited understanding, I think Eric Snow’s push to use subinterpreters is to move an orchestration layer for multiple Python processes from the service layer to the language layer. It may also modularize Pythons’s C API scope. It may also be one of the cheapest ways in order to provide for true CPU bound concurrency in Python, which is important given Python’s limited resources.

[+] MichaelMoser123|6 years ago|reply

Wow, just like perl threads since perl 5.8 (1) When in doubt, look at the granddaddy of scripting languages, all your trials and tribulations in scripting land have been considered in the past.. let's all sing 'living in the past' by Jethro Tull (2) this one is also good (3)

(1) https://perldoc.perl.org/threads.html

(2) https://m.youtube.com/watch?v=EsCyC1dZiN8

(3) https://m.youtube.com/watch?v=mXeoNX7DSc8

[+] andrewshadura|6 years ago|reply

Tcl has had threads that were subinterpreters since a decade ago or more. I find it quite ironic that Python, it would seem, is reinventing it, only in a less elegant way.

[+] rkeene2|6 years ago|reply

I'm personally glad that Python is (poorly) copying this feature from Tcl. This means it's closer to the time when JavaScript (poorly) copies it from Python ! ;-)

[+] cmacleod4|6 years ago|reply

Actually Tcl has been successfully using the model of one or more interpreters per thread since Tcl 8.1, released in 1999, a full TWO decades now.

[+] mixmastamyk|6 years ago|reply

The functionality was always there it just rusted over from disuse.

[+] yjftsjthsd-h|6 years ago|reply

Everything old is new again:)

[+] fithisux|6 years ago|reply

My thoghts exactly.

[+] bch|6 years ago|reply

This sounds like an application (or variation) of the apartment threading model[0]. Given the problem and it’s desrciption/characteristic (Global Interpretter Lock), this sounds like an elegant approach.

[0] https://docs.microsoft.com/en-us/windows/desktop/com/process...

[+] mballantyne|6 years ago|reply

Racket's "places" work a similar way, though do a bit extra to get down to one memory copy, rather than two: https://www.cs.utah.edu/plt/publications/dls11-tsffd.pdf

[+] Uptrenda|6 years ago|reply

There's nothing wrong with the GIL as long as you know its there. It makes writing concurrent code in Python semi-magical and thats a huge benefit. Concurrent != parallel though, so if there's really a need to scale up to multiple cores there's always the option of forking with multi-processing or "sub interpreters."

I can think of maybe having network code run in their own process and the UI in another. That way there's no risk of bottle necks slowing down the UI and transfers are likewise protected. If you look at bottle.py it seems that this approach could add A LOT of performance for managing downloads / uploads if it's done right.

[+] weberc2|6 years ago|reply

How does the GIL help you write concurrent code?

[+] cyphar|6 years ago|reply

> Another issue is that file handles belong to the process, so if you have a file open for writing in one interpreter, the sub interpreter won’t be able to access the file (without further changes to CPython).

Wouldn't just using CLONE_FILES when forking off interpreters solve this problem?

[+] unknown|6 years ago|reply

[deleted]

[+] qwerty456127|6 years ago|reply

> The GIL also means that whilst CPython can be multi-threaded, only 1 thread can be executing at any given time.

How does this make sense? What's the point of having multiple threads then?

[+] jcl|6 years ago|reply

It could be better phrased: "whilst CPython can be multi-threaded, only 1 thread can be executing Python code at any given time." Other threads can be doing other things at the same time -- just not actively interpreting Python bytecode.

[+] xkgt|6 years ago|reply

It is because only one thread at a time holds the lock in order to avoid race conditions. The keynote[1] by Raymond Hettinger from PyBay '17 will be a great place to start if you are new to this.

[1] https://youtu.be/9zinZmE3Ogk

[+] keypusher|6 years ago|reply

Not all operations are CPU bound. For anything that is IO bound, such as reading a file, db access, network calls, etc, CPython threads work just fine.

[+] pletnes|6 years ago|reply

some C libraries release the GIL before running CPU intensive computations. Examples include numpy and hashlib.

[+] isbvhodnvemrwvn|6 years ago|reply

Parallelism allows multiple threads interleave with each other. It does not guarantee parallelism (two or more threads executing at the same time). It's similar to multiple threads operating on a uniprocessor system, with the difference that I/O can happen in parallel.

[+] munchbunny|6 years ago|reply

I believe this is still beneficial in I/O bound processes.

[+] boulos|6 years ago|reply

The usual answer is: in the case of blocking I/O, the thread running send/recv can block while other python code runs.

In practice, this doesn’t work particularly well, as you rarely have massively I/O bound things in Python.

[+] riskneutral|6 years ago|reply

"How much overhead does a sub-interpreter have? Short answer: More than a thread, less than a process."

So ... No.

[+] Alex3917|6 years ago|reply

Are there any overall benchmarks for Python 3.8 yet? I know there are a bunch of performance improvements for calling functions and creating objects, but I have no idea how that translates to real software.

[+] dragonwriter|6 years ago|reply

Huh. This sounds a lot like Ruby Guilds. This looks it will land sooner, though likely in less complete form, as even the prototype Guild implementation has inter-guild communication.

[+] sciurus|6 years ago|reply

Some earlier coverage: https://lwn.net/Articles/754162/

131 comments