> Python currently has a single global interpreter lock per process, which prevents multi-threaded parallelism. This work, described in PEP 684, is to make all global state thread safe and move to a global interpreter lock (GIL) per sub-interpreter. Additionally, PEP 554 will make it possible to create subinterpreters from Python (currently a C API-only feature), opening up true multi-threaded parallelism.
Very basic question: in a world where a Python program can spin up multiple subinterpreters, each of which can then execute on a separate CPU core (since they don't share a GIL), what will the best mechanisms be for passing data between those subinterpreters?
> There are a number of valid solutions, several of which may be appropriate to support in Python. This proposal provides a single basic solution: “channels”. Ultimately, any other solution will look similar to the proposed one, which will set the precedent. Note that the implementation of Interpreter.run() will be done in a way that allows for multiple solutions to coexist, but doing so is not technically a part of the proposal here.
> Regarding the proposed solution, “channels”, it is a basic, opt-in data sharing mechanism that draws inspiration from pipes, queues, and CSP’s channels.
> As simply described earlier by the API summary, channels have two operations: send and receive. A key characteristic of those operations is that channels transmit data derived from Python objects rather than the objects themselves. When objects are sent, their data is extracted. When the “object” is received in the other interpreter, the data is converted back into an object owned by that interpreter.
> Very basic question: [not basic at all question which has been the subject of decades of research and produced several specialized programming models]
(Brackets my own of course.)
Sharing data in concurrent programs is not trivial, especially in environments where data is mutable. The most trivial answer to the question is “message passing”, as in the SmallTalk notion of OOP or the Erlang/OTP Actor Model. Some solutions look much more like working with a database (Software Transactional Memory). Some models that seem entirely designed for a different problem space are also compelling (various state models common in UI and games like reactivity and Entity Component systems).
This plan sounds very much like Ruby Ractors, which are essentially sub-interpreters, each with their own GVL.
Shareable data is basically immutable data + classes/modules, and unshareable data can be transmitted via push (send+receive) or pull (yield+take). Transmission implies either deep copying (which "forks" the instances) or moving with ownership change (sender then loses access)
If it's performance, then, since subinterpreters run in the same process, it would be global shared state. You can't use Python objects across subinterpreters, but raw byte arrays will work just fine, provided you do your own locking correctly around all that.
Python would need to implement a multiconsumer multiproducer ringbuffer or a blocking free algorithm (I'm not sure if it is wait free) such as the actor system I implemented below
To apply to Python The subinterpreters could transfer ownership of the refcounts between subinterpreters as part of an enqueue and dequeue.
I believe the refcount locking approach has scalability problems between threads.
I implemented a multithreaded actor system with work stealing in Java and message passing can get to throughputs of around 50-60 million messages per second without blocking or mutexes. The only lock is not quite a spinlock. I use an algorithm I created but inspired by this whitepaper [1], which is simple but works. It's probably a known algorithm but I'm not sure of the name of it.
I have a multidimensional array of actor inboxes (each actor has multiple buffers for filling by other threads to lower contention to 0) then there is an integer stored for the thread that is trying to read or write to the critical section.
The threads all scan this multidimensional array forwardly and backwardly to see if another thread is in the critical section. If nobody is there, it marks the critical section. It then scans again to see if it is still valid. it's similar to going into a room and scanning the room left and scanning the room right. Surprisingly this leads to thread safely. I wrote a python model checker to verify the algorithm is correct.
Without message generation within threads, it can communicate and sum 1 billion integers in 1 second due to parallelism (it takes 2 seconds to do this with one thread) It takes advantage of the idea that variable assignment can transfer any amount of data in an assignment.
See Actor2.java (1 billion sums a second messages created in advance), Actor2MessageGeneration.java (20 million requests per second, messages created as we go) or Actor2ParallelMessageCreation.java (50-60 million requests per second, with parallel message creation)
There's also a Java multiconsumer multiproducer ringbuffer in this repository [3] too which I ported from Alexander Krizhanovsky [2]
I think it's been decided that such a change was so large that it would require a major version change in Python. However, I think that was unauthoritative hearsay probably in another comment thread here on HN. But it stands to reason that removing the GIL will almost certainly change Python's memory model in some ways that could break code in ways that warrant a major version bump.
AFAIK it was implemented in 3.11. That is, all of it except of the GIL removal itself, which actually decreased performance for single-threaded code; the actual improvement was elsewhere.
I think this is so smart. The main thing holding back replacement of the GIL at the moment is that there is a VAST existing ecosystem of Python packages written in C/etc that would likely break without it.
Multiple interpreters with their own GIL keep all of that existing code working without any changes, and mean we can run a Python program on more than one CPU at the same time.
It comes at a cost, of course.
You don't really have shared memory state, which is often easiest to conceptually think about.
So you are just transforming the problem into a data sharing problem between interpreters, which requires careful thought on both the language side for abstractions, and the consumer side to use right.
It also makes the tooling and verification much harder in practice - for example, you aren't reasoning about deadlocks in a single process anymore, but both within a single process and across any processes it communicates with.
At an abstract level, they are transformable into each other.
At a pragmatic level, well, there is a good reason you mostly see tooling for single-process multi-threaded programs :)
If all global state is made thread safe and, then whether threads are subinterpreters or a single interpreter is conceptually irrelevant and probably easier to implement.
I'm glad to see this as an outline, which is how I structure most of my project work. It can be hard for others to follow, but it's very concise and scannable (just read the first indentation level for the top-level idea).
To paraphrase Adam Savage from his excellent book, Every Tool's a Hammer, lists [of lists] are very powerful way to tame the inherit complexity of any project worth doing.
This project has already landed improvements in 3.10, and some much bigger improvements in 3.11. This work for 3.12 is "just" a continuation of that excellent effort:
It's going to be a bit of a chicken and egg problem, core Python will need to prove it's worthwhile for extension devs to implement, core Python will struggle without support from extension devs. We shall see.
Not exactly. I would describe this to be from Microsoft faction of Python Software Foundation. So yes, some members of Python Software Foundation (mainly Microsoft employees) are behind this, but not all members are.
It’s a bit frustrating to see the first item related to parallelism and the GIL. Anybody doing parallel compute in Python has long since worked around these issues. IMHO Python needs better single threaded performance first, and then once all the juice has been squeezed from that lemon, we can sit down and get serious about improving multi threaded ergonomics.
I don't really use Python if I can help it, but I'm still really glad to see people working on this. Whether I like it or not Python will probably always be some part of my job and I really appreciate that there's finally some focus on it getting faster that isn't just "write that part in C".
It's not a case of money imho, it's a case of it's a juggernaut of a userbase and ecosystem that moves very slowly and implementing improvements to execution times (generally) are intremental changes, not paradigm shifts as they make backwards compatability a nightmare/impossible.
I mean 2.x is still in the wild and some companies provide support for it, still!
This is possible, but it would need some backwards incompatible in the object model. We still are likely to see Python 4 on one day. People are still remembering the pain Python 3 transition caused.
Unlikely. Python has lots of features that were added without any thought to how to make them run fast - it simply wasn't a goal. As a result Python includes a ton of dynamic features that make it really hard to optimise.
Are there any plans to remove threading from python ?
A year or two ago I read up on the various efforts to make a fast, more parallel CPython, and one of the core underlying problems seemed to be the use of machine threads, resulting in a very high locking load as the large (potentially unlimited) number of threads attempted to defend against each other.
Letting an operating system run random fragments of your code at random times is very much a self-inflicted wound, so I was wondering if the python community has any plans to not do that any more ?
I would seriously consider a risc-v assembly port of a python interpreter.
Removing fanatic compiler abuse is always a good thing. That said, I saw some assembler macro abuse (some assemblers out there have extremely powerful and complex macro pre-processors), then the hard part would be not to abuse the macro pre-processor of the assembler.
I know it is not to make python actually "faster", but to have a python implementation which does not require those grotesquely and absurdely massive compilers, then the SDK stack would be way more reasonable from a technical cost stand point.
[+] [-] simonw|3 years ago|reply
> Python currently has a single global interpreter lock per process, which prevents multi-threaded parallelism. This work, described in PEP 684, is to make all global state thread safe and move to a global interpreter lock (GIL) per sub-interpreter. Additionally, PEP 554 will make it possible to create subinterpreters from Python (currently a C API-only feature), opening up true multi-threaded parallelism.
Very basic question: in a world where a Python program can spin up multiple subinterpreters, each of which can then execute on a separate CPU core (since they don't share a GIL), what will the best mechanisms be for passing data between those subinterpreters?
[+] [-] theandrewbailey|3 years ago|reply
> Regarding the proposed solution, “channels”, it is a basic, opt-in data sharing mechanism that draws inspiration from pipes, queues, and CSP’s channels.
> As simply described earlier by the API summary, channels have two operations: send and receive. A key characteristic of those operations is that channels transmit data derived from Python objects rather than the objects themselves. When objects are sent, their data is extracted. When the “object” is received in the other interpreter, the data is converted back into an object owned by that interpreter.
https://peps.python.org/pep-0554/#shared-data
[+] [-] eyelidlessness|3 years ago|reply
(Brackets my own of course.)
Sharing data in concurrent programs is not trivial, especially in environments where data is mutable. The most trivial answer to the question is “message passing”, as in the SmallTalk notion of OOP or the Erlang/OTP Actor Model. Some solutions look much more like working with a database (Software Transactional Memory). Some models that seem entirely designed for a different problem space are also compelling (various state models common in UI and games like reactivity and Entity Component systems).
[+] [-] lloeki|3 years ago|reply
Shareable data is basically immutable data + classes/modules, and unshareable data can be transmitted via push (send+receive) or pull (yield+take). Transmission implies either deep copying (which "forks" the instances) or moving with ownership change (sender then loses access)
See here for details: https://docs.ruby-lang.org/en/master/ractor_md.html
[+] [-] int_19h|3 years ago|reply
If it's performance, then, since subinterpreters run in the same process, it would be global shared state. You can't use Python objects across subinterpreters, but raw byte arrays will work just fine, provided you do your own locking correctly around all that.
[+] [-] samsquire|3 years ago|reply
To apply to Python The subinterpreters could transfer ownership of the refcounts between subinterpreters as part of an enqueue and dequeue.
I believe the refcount locking approach has scalability problems between threads.
I implemented a multithreaded actor system with work stealing in Java and message passing can get to throughputs of around 50-60 million messages per second without blocking or mutexes. The only lock is not quite a spinlock. I use an algorithm I created but inspired by this whitepaper [1], which is simple but works. It's probably a known algorithm but I'm not sure of the name of it.
I have a multidimensional array of actor inboxes (each actor has multiple buffers for filling by other threads to lower contention to 0) then there is an integer stored for the thread that is trying to read or write to the critical section.
The threads all scan this multidimensional array forwardly and backwardly to see if another thread is in the critical section. If nobody is there, it marks the critical section. It then scans again to see if it is still valid. it's similar to going into a room and scanning the room left and scanning the room right. Surprisingly this leads to thread safely. I wrote a python model checker to verify the algorithm is correct.
Without message generation within threads, it can communicate and sum 1 billion integers in 1 second due to parallelism (it takes 2 seconds to do this with one thread) It takes advantage of the idea that variable assignment can transfer any amount of data in an assignment.
See Actor2.java (1 billion sums a second messages created in advance), Actor2MessageGeneration.java (20 million requests per second, messages created as we go) or Actor2ParallelMessageCreation.java (50-60 million requests per second, with parallel message creation)
There's also a Java multiconsumer multiproducer ringbuffer in this repository [3] too which I ported from Alexander Krizhanovsky [2]
[1]: https://lag.net/papers/content/leftright-extended.pdf
[2]: https://www.linuxjournal.com/content/lock-free-multi-produce...
[3]: https://github.com/samsquire/multiversion-concurrency-contro...
[+] [-] umanwizard|3 years ago|reply
[+] [-] dwrodri|3 years ago|reply
[+] [-] js2|3 years ago|reply
https://github.com/colesbury/nogil/
[+] [-] BerislavLopac|3 years ago|reply
[+] [-] theandrewbailey|3 years ago|reply
> Implement PEP 554
> PEP 554 - Multiple Interpreters in the Stdlib
That's going to be fun. Why fight the GIL when multithreading, when you can just get around it with more interpreters?
[+] [-] fny|3 years ago|reply
[+] [-] simonw|3 years ago|reply
Multiple interpreters with their own GIL keep all of that existing code working without any changes, and mean we can run a Python program on more than one CPU at the same time.
[+] [-] DannyBee|3 years ago|reply
So you are just transforming the problem into a data sharing problem between interpreters, which requires careful thought on both the language side for abstractions, and the consumer side to use right.
It also makes the tooling and verification much harder in practice - for example, you aren't reasoning about deadlocks in a single process anymore, but both within a single process and across any processes it communicates with.
At an abstract level, they are transformable into each other. At a pragmatic level, well, there is a good reason you mostly see tooling for single-process multi-threaded programs :)
[+] [-] isthisthingon99|3 years ago|reply
[+] [-] Whitespace|3 years ago|reply
To paraphrase Adam Savage from his excellent book, Every Tool's a Hammer, lists [of lists] are very powerful way to tame the inherit complexity of any project worth doing.
[+] [-] d0mine|3 years ago|reply
[+] [-] crazytalk|3 years ago|reply
[+] [-] Twirrim|3 years ago|reply
https://www.phoronix.com/review/python-311-benchmarks/4
[+] [-] nedbat|3 years ago|reply
[+] [-] jupp0r|3 years ago|reply
This will break many modules. Basically any that use static variables, which is done pretty much everywhere.
[+] [-] oblvious-earth|3 years ago|reply
It's going to be a bit of a chicken and egg problem, core Python will need to prove it's worthwhile for extension devs to implement, core Python will struggle without support from extension devs. We shall see.
[+] [-] jensus|3 years ago|reply
[+] [-] judge2020|3 years ago|reply
[+] [-] rovr138|3 years ago|reply
Gotta admit. That sounds pretty official
[+] [-] LVB|3 years ago|reply
[+] [-] sanxiyn|3 years ago|reply
[+] [-] __s|3 years ago|reply
[+] [-] qeternity|3 years ago|reply
[+] [-] staticassertion|3 years ago|reply
[+] [-] lsofzz|3 years ago|reply
I'll check back in a decade.
[+] [-] jokoon|3 years ago|reply
[+] [-] _joel|3 years ago|reply
I mean 2.x is still in the wild and some companies provide support for it, still!
[+] [-] miohtama|3 years ago|reply
[+] [-] IshKebab|3 years ago|reply
[+] [-] ahrzb|3 years ago|reply
[+] [-] robertlagrant|3 years ago|reply
[+] [-] mixmastamyk|3 years ago|reply
[+] [-] remram|3 years ago|reply
[+] [-] MikeYasnev007|3 years ago|reply
[+] [-] MikeYasnev007|3 years ago|reply
[+] [-] melonrusk|3 years ago|reply
A year or two ago I read up on the various efforts to make a fast, more parallel CPython, and one of the core underlying problems seemed to be the use of machine threads, resulting in a very high locking load as the large (potentially unlimited) number of threads attempted to defend against each other.
Letting an operating system run random fragments of your code at random times is very much a self-inflicted wound, so I was wondering if the python community has any plans to not do that any more ?
[+] [-] miohtama|3 years ago|reply
[+] [-] sylware|3 years ago|reply
Removing fanatic compiler abuse is always a good thing. That said, I saw some assembler macro abuse (some assemblers out there have extremely powerful and complex macro pre-processors), then the hard part would be not to abuse the macro pre-processor of the assembler.
I know it is not to make python actually "faster", but to have a python implementation which does not require those grotesquely and absurdely massive compilers, then the SDK stack would be way more reasonable from a technical cost stand point.
[+] [-] unknown|3 years ago|reply
[deleted]