Faster CPython 3.12 Plan

[+] simonw|3 years ago|reply

https://github.com/faster-cpython/ideas/wiki/Python-3.12-Goa... is interesting too.

> Python currently has a single global interpreter lock per process, which prevents multi-threaded parallelism. This work, described in PEP 684, is to make all global state thread safe and move to a global interpreter lock (GIL) per sub-interpreter. Additionally, PEP 554 will make it possible to create subinterpreters from Python (currently a C API-only feature), opening up true multi-threaded parallelism.

Very basic question: in a world where a Python program can spin up multiple subinterpreters, each of which can then execute on a separate CPU core (since they don't share a GIL), what will the best mechanisms be for passing data between those subinterpreters?

[+] theandrewbailey|3 years ago|reply

> There are a number of valid solutions, several of which may be appropriate to support in Python. This proposal provides a single basic solution: “channels”. Ultimately, any other solution will look similar to the proposed one, which will set the precedent. Note that the implementation of Interpreter.run() will be done in a way that allows for multiple solutions to coexist, but doing so is not technically a part of the proposal here.

> Regarding the proposed solution, “channels”, it is a basic, opt-in data sharing mechanism that draws inspiration from pipes, queues, and CSP’s channels.

> As simply described earlier by the API summary, channels have two operations: send and receive. A key characteristic of those operations is that channels transmit data derived from Python objects rather than the objects themselves. When objects are sent, their data is extracted. When the “object” is received in the other interpreter, the data is converted back into an object owned by that interpreter.

https://peps.python.org/pep-0554/#shared-data

[+] eyelidlessness|3 years ago|reply

> Very basic question: [not basic at all question which has been the subject of decades of research and produced several specialized programming models]

(Brackets my own of course.)

Sharing data in concurrent programs is not trivial, especially in environments where data is mutable. The most trivial answer to the question is “message passing”, as in the SmallTalk notion of OOP or the Erlang/OTP Actor Model. Some solutions look much more like working with a database (Software Transactional Memory). Some models that seem entirely designed for a different problem space are also compelling (various state models common in UI and games like reactivity and Entity Component systems).

[+] lloeki|3 years ago|reply

This plan sounds very much like Ruby Ractors, which are essentially sub-interpreters, each with their own GVL.

Shareable data is basically immutable data + classes/modules, and unshareable data can be transmitted via push (send+receive) or pull (yield+take). Transmission implies either deep copying (which "forks" the instances) or moving with ownership change (sender then loses access)

See here for details: https://docs.ruby-lang.org/en/master/ractor_md.html

[+] int_19h|3 years ago|reply

Depends on what criteria you use for "best".

If it's performance, then, since subinterpreters run in the same process, it would be global shared state. You can't use Python objects across subinterpreters, but raw byte arrays will work just fine, provided you do your own locking correctly around all that.

[+] samsquire|3 years ago|reply

Python would need to implement a multiconsumer multiproducer ringbuffer or a blocking free algorithm (I'm not sure if it is wait free) such as the actor system I implemented below

To apply to Python The subinterpreters could transfer ownership of the refcounts between subinterpreters as part of an enqueue and dequeue.

I believe the refcount locking approach has scalability problems between threads.

I implemented a multithreaded actor system with work stealing in Java and message passing can get to throughputs of around 50-60 million messages per second without blocking or mutexes. The only lock is not quite a spinlock. I use an algorithm I created but inspired by this whitepaper [1], which is simple but works. It's probably a known algorithm but I'm not sure of the name of it.

I have a multidimensional array of actor inboxes (each actor has multiple buffers for filling by other threads to lower contention to 0) then there is an integer stored for the thread that is trying to read or write to the critical section.

The threads all scan this multidimensional array forwardly and backwardly to see if another thread is in the critical section. If nobody is there, it marks the critical section. It then scans again to see if it is still valid. it's similar to going into a room and scanning the room left and scanning the room right. Surprisingly this leads to thread safely. I wrote a python model checker to verify the algorithm is correct.

Without message generation within threads, it can communicate and sum 1 billion integers in 1 second due to parallelism (it takes 2 seconds to do this with one thread) It takes advantage of the idea that variable assignment can transfer any amount of data in an assignment.

See Actor2.java (1 billion sums a second messages created in advance), Actor2MessageGeneration.java (20 million requests per second, messages created as we go) or Actor2ParallelMessageCreation.java (50-60 million requests per second, with parallel message creation)

There's also a Java multiconsumer multiproducer ringbuffer in this repository [3] too which I ported from Alexander Krizhanovsky [2]

[1]: https://lag.net/papers/content/leftright-extended.pdf

[2]: https://www.linuxjournal.com/content/lock-free-multi-produce...

[3]: https://github.com/samsquire/multiversion-concurrency-contro...

[+] umanwizard|3 years ago|reply

Does anyone know what came of Sam Gross’s proof of concept that removed the GIL entirely? Is that proposal effectively dead?

[+] dwrodri|3 years ago|reply

I think it's been decided that such a change was so large that it would require a major version change in Python. However, I think that was unauthoritative hearsay probably in another comment thread here on HN. But it stands to reason that removing the GIL will almost certainly change Python's memory model in some ways that could break code in ways that warrant a major version bump.

[+] js2|3 years ago|reply

Looks like it's still active to me:

https://github.com/colesbury/nogil/

[+] BerislavLopac|3 years ago|reply

AFAIK it was implemented in 3.11. That is, all of it except of the GIL removal itself, which actually decreased performance for single-threaded code; the actual improvement was elsewhere.

[+] theandrewbailey|3 years ago|reply

> Expose multiple interpreters to Python code

> Implement PEP 554

> PEP 554 - Multiple Interpreters in the Stdlib

That's going to be fun. Why fight the GIL when multithreading, when you can just get around it with more interpreters?

[+] fny|3 years ago|reply

How is this different from multiprocessing? The examples looks like a complete nightmare...

    interp = interpreters.create()
    interp.run(tw.dedent("""
        import some_lib
        import an_expensive_module
        some_lib.set_up()
        """))
    wait_for_request()
    interp.run(tw.dedent("""
        some_lib.handle_request()
        """))

I'm actually shocked this is even being contemplated. We've regressed to evaling?

[+] simonw|3 years ago|reply

I think this is so smart. The main thing holding back replacement of the GIL at the moment is that there is a VAST existing ecosystem of Python packages written in C/etc that would likely break without it.

Multiple interpreters with their own GIL keep all of that existing code working without any changes, and mean we can run a Python program on more than one CPU at the same time.

[+] DannyBee|3 years ago|reply

It comes at a cost, of course. You don't really have shared memory state, which is often easiest to conceptually think about.

So you are just transforming the problem into a data sharing problem between interpreters, which requires careful thought on both the language side for abstractions, and the consumer side to use right.

It also makes the tooling and verification much harder in practice - for example, you aren't reasoning about deadlocks in a single process anymore, but both within a single process and across any processes it communicates with.

At an abstract level, they are transformable into each other. At a pragmatic level, well, there is a good reason you mostly see tooling for single-process multi-threaded programs :)

[+] isthisthingon99|3 years ago|reply

If all global state is made thread safe and, then whether threads are subinterpreters or a single interpreter is conceptually irrelevant and probably easier to implement.

[+] Whitespace|3 years ago|reply

I'm glad to see this as an outline, which is how I structure most of my project work. It can be hard for others to follow, but it's very concise and scannable (just read the first indentation level for the top-level idea).

To paraphrase Adam Savage from his excellent book, Every Tool's a Hammer, lists [of lists] are very powerful way to tame the inherit complexity of any project worth doing.

[+] d0mine|3 years ago|reply

You might like creating outline using Org mode (the ultimate hammer) https://www.youtube.com/watch?v=VcgjTEa0kU4&t=344

[+] crazytalk|3 years ago|reply

Looking forward to runnable code this time. Most of us are old enough by now to remember many project plans just like this one. Fool me once..

[+] Twirrim|3 years ago|reply

This project has already landed improvements in 3.10, and some much bigger improvements in 3.11. This work for 3.12 is "just" a continuation of that excellent effort:

https://www.phoronix.com/review/python-311-benchmarks/4

[+] nedbat|3 years ago|reply

Have you followed the 3.11 performance improvements by the same group? It's ~25% faster than 3.10.

[+] jupp0r|3 years ago|reply

> Per-interpreter isolation for extension modules

This will break many modules. Basically any that use static variables, which is done pretty much everywhere.

[+] oblvious-earth|3 years ago|reply

Yes this would be a challenge for extension modules to implement support for this. Here is a discussion between the core dev and the numpy team: https://mail.python.org/archives/list/numpy-discussion@pytho...

It's going to be a bit of a chicken and egg problem, core Python will need to prove it's worthwhile for extension devs to implement, core Python will struggle without support from extension devs. We shall see.

[+] jensus|3 years ago|reply

If it's static would it not get it's own allocation within each of the isolated interpreters?

[+] judge2020|3 years ago|reply

Is this from an official Python Foundation group? Weird that it's not under the main `python` github org.

[+] rovr138|3 years ago|reply

> Guido van Rossum edited this page yesterday · 13 revisions

Gotta admit. That sounds pretty official

[+] LVB|3 years ago|reply

Some background on the project: https://github.com/faster-cpython/ideas/blob/main/FasterCPyt...

[+] sanxiyn|3 years ago|reply

Not exactly. I would describe this to be from Microsoft faction of Python Software Foundation. So yes, some members of Python Software Foundation (mainly Microsoft employees) are behind this, but not all members are.

[+] __s|3 years ago|reply

Python core developers are behind it at least

[+] qeternity|3 years ago|reply

It’s a bit frustrating to see the first item related to parallelism and the GIL. Anybody doing parallel compute in Python has long since worked around these issues. IMHO Python needs better single threaded performance first, and then once all the juice has been squeezed from that lemon, we can sit down and get serious about improving multi threaded ergonomics.

[+] staticassertion|3 years ago|reply

I don't really use Python if I can help it, but I'm still really glad to see people working on this. Whether I like it or not Python will probably always be some part of my job and I really appreciate that there's finally some focus on it getting faster that isn't just "write that part in C".

[+] lsofzz|3 years ago|reply

A "faster" CPython is a decade plus minus four years old story ;_;

I'll check back in a decade.

[+] jokoon|3 years ago|reply

I wonder if python could possibly be as fast as js if enough money was spent on it.

[+] _joel|3 years ago|reply

It's not a case of money imho, it's a case of it's a juggernaut of a userbase and ecosystem that moves very slowly and implementing improvements to execution times (generally) are intremental changes, not paradigm shifts as they make backwards compatability a nightmare/impossible.

I mean 2.x is still in the wild and some companies provide support for it, still!

[+] miohtama|3 years ago|reply

This is possible, but it would need some backwards incompatible in the object model. We still are likely to see Python 4 on one day. People are still remembering the pain Python 3 transition caused.

[+] IshKebab|3 years ago|reply

Unlikely. Python has lots of features that were added without any thought to how to make them run fast - it simply wasn't a goal. As a result Python includes a ton of dynamic features that make it really hard to optimise.

[+] ahrzb|3 years ago|reply

I don’t think so, there are some design choices that make the two quite different. This „dictionaries all the way down“ approach has a cost.

[+] robertlagrant|3 years ago|reply

How will this affect things if I have an instance of the built in SQLite? Will it be accessible by multiple GILs at once?

[+] mixmastamyk|3 years ago|reply

Sqlite itself supports concurrent reads but not writes, so Python will likely not improve on that.

[+] remram|3 years ago|reply

Yes, just like you can already access it from multiple threads in the same interpreter right now.

[+] MikeYasnev007|3 years ago|reply

Perhaps it is also good to build Artemis project on hn infra. Articles and discussions. I will create mirror on habr

[+] MikeYasnev007|3 years ago|reply

I will review this tech during this week also as jq tools

[+] melonrusk|3 years ago|reply

Are there any plans to remove threading from python ?

A year or two ago I read up on the various efforts to make a fast, more parallel CPython, and one of the core underlying problems seemed to be the use of machine threads, resulting in a very high locking load as the large (potentially unlimited) number of threads attempted to defend against each other.

Letting an operating system run random fragments of your code at random times is very much a self-inflicted wound, so I was wondering if the python community has any plans to not do that any more ?

[+] miohtama|3 years ago|reply

Threaded IO code on Python is fine if you are IO limited, as opposite to CPU limited. Most web workloads are like this as they wait database.

[+] sylware|3 years ago|reply

I would seriously consider a risc-v assembly port of a python interpreter.

Removing fanatic compiler abuse is always a good thing. That said, I saw some assembler macro abuse (some assemblers out there have extremely powerful and complex macro pre-processors), then the hard part would be not to abuse the macro pre-processor of the assembler.

I know it is not to make python actually "faster", but to have a python implementation which does not require those grotesquely and absurdely massive compilers, then the SDK stack would be way more reasonable from a technical cost stand point.

[+] unknown|3 years ago|reply

[deleted]

139 comments