PyPy - We need Software Transactional Memory

[+] arohner|14 years ago|reply

Clojure's STM has several advantages going for it:

* Clojure's data structures are purely functional, and built with the STM in mind. Updates and rollbacks to structures are cheap.

* Clojure user-level code has always had access to the STM. Because that's the accepted paradigm, There are an order of magnitude fewer assignments in typical clojure code, greatly simplifying the problem. In typical clojure, one function that uses the STM will call 10 pure functions. In typical python, most of those 10 calls will all mutate local state.

* Clojure requires all STM activity to happen within the dosync block. In python, that would be like requiring the magic to happen within a "with transaction(): ...". Again, this limits the amount of work that needs to happen. The compiler doesn't need to decide whether or not a random function needs to be a transaction; the user will tell the compiler.

* Clojure's STM explicitly only works with functional code. This will likely cause fewer leaky abstractions going forward. PyPy will have to modify all existing data structure classes to work with the new STM, and they won't be able to fix all the library and user code out in the world.

Good luck to them, but I suspect bolting STM into an existing language is much harder than designing it from scratch, similar to building in GC.

[+] exDM69|14 years ago|reply

How does Clojure enforce that there are no side-effects in the transactional code? Is there a macro that "compiles" the transaction somehow?

STM in Haskell is very natural, since Haskell already enforces pure code and side-effecting code. STM and STVar work pretty similarly to IO and IORef or ST and TVar.

[+] masklinn|14 years ago|reply

I think you've misunderstood what they're proposing here. Especially this:

> Clojure requires all STM activity to happen within the dosync block. In python, that would be like requiring the magic to happen within a "with transaction(): ...".

Under the proposal, all Python code would run under STM, only C-API and OS-level codes would run outside STM.

Edit: to clarify, the most common issue with implementing a viable STM (in just about every language apart from haskell) is the interaction between STM-managed objects and non-stm-managed objects. Under Pypy's proposal, there would be no such thing as all objects would be STM-managed.

[+] jerf|14 years ago|reply

This may work, this may not, but some nod to the fact that they've read things like http://queue.acm.org/detail.cfm?id=1454466 (as linked by codevine before me) and have some plan to address the problems would be nice. So far, attempts to bodge STM on to an imperative language that wasn't built with it in mind has not merely been failure, but radical failure. Immense amount of effort by very smart people, followed by the total unusability and unsalvagability of the resulting code.

The fact that only place it has been shown to work is basically Clojure and Haskell needs to be addressed; how are you going to address the apparent fundamental incompatibility between imperative languages and STM? And while you may be able to answer that at the VM level (very tiny transactions implemented between opcodes, basically), I would expect it is absolutely hopeless to allow users access to this. How are you going to address the endless-side-effect replay problem? (Hint: You aren't.)

This is a kill-PyPy level of dangerous decision, and I think everyone on that team ought to demand a very good answer to that question before buying into this.

[+] pradocchia|14 years ago|reply

Until we have a new conceptual breakthrough, multiversion concurrency control (MVCC) [1], which STM represents in this context, is the only solution they have.

Yes it will impose a tax on all mutable data structures, and push people towards a more functional style of programming. This is generally good. It may be a challenge for people who think solely in terms of OOP, but then OOP will evolve in response, and that is also good.

[1] http://en.wikipedia.org/wiki/Multiversion_concurrency_contro...

[+] lloeki|14 years ago|reply

The way I read it, it's a PyPy-the-VM feature, used in RPython, but not a PyPy-the-interpreter feature callable in the Python code interpreted, hence it's not available to users (i.e Python devs)

[+] scott_s|14 years ago|reply

From a PyPy as a practical tool point of view, I agree.

From a PyPy as a research project, I'm excited. I'm skeptical of STM as a synchronization solution as well, but I still want us to explore it. If their approach truly is novel, then maybe there's something there.

[+] pnathan|14 years ago|reply

We've been mashing out Python for the last few months at my job, and we are consistently struggling with the Python GIL performance penalty for threads.

If PyPy can break the GIL, I would anticipate that we'd seriously consider pushing our users toward it.

[+] DasIch|14 years ago|reply

This is not a dangerous decision at all and there is no reason to involve everyone working on PyPy either.

It would not be the first time they tried implementing something and failed either.

[+] joeyespo|14 years ago|reply

Microsoft has been doing extensive research with STM, so their work might be worth looking at along the way. A working STM package (STM.NET) was actually shipped with VS2010, where you can type this:

    Atomic.Do(() => { /* atomic code */ })

Here's some channel 9 videos about their progress, which are in of themselves very interesting:

http://channel9.msdn.com/Shows/Going+Deep/Software-Transacti...

http://channel9.msdn.com/Blogs/Charles/STMNET-Who-What-Why

And here are some research papers published by Tim Harris and the old STM research blog:

http://research.microsoft.com/en-us/um/people/tharris/

http://blogs.msdn.com/b/stmteam/archive/2010/05/12/stm-net-d...

Fascinating stuff. I'm wary of having it be at the core of Python, but very interested to see where PyPy ends up.

[+] arethuza|14 years ago|reply

They also have a file system that can enlist in transactions:

http://en.wikipedia.org/wiki/Transactional_NTFS

[+] wgren|14 years ago|reply

>if [a JVM thread] reads a field of an object, and another thread writes to the same field of the same object, then the program will see either the old value, or the new value, but not something else entirely

Nitpick: unless the field in questions is a "long" or "double". In those cases the operation becomes two bytecodes and you can in fact see something else entirely if you are unlucky. At least it used to be like that.

[+] palish|14 years ago|reply

On 32-bit platforms.

On 64-bit, you won't. =)

[+] codedivine|14 years ago|reply

STM sounds all nice, yet there are few implementations of STM currently that give better performance than a good serial implementation.

[+] cabacon|14 years ago|reply

Then again, hardware transactional memory may be arriving: http://www.eetimes.com/electronics-news/4218914/IBM-plants-t...

[+] exDM69|14 years ago|reply

I hope you have sources and/or benchmarks to support your claims. Bashing a promising future technology without any backing is not constructive.

[+] masklinn|14 years ago|reply

There's the same issue with normal locks: any synchronization primitive is going to incur overhead. That is, in fact, one of the defense heralded by some in the Python community against the "GIL should Go" mobthink.

[+] bfrog|14 years ago|reply

Python will always fail to do any true shared memory concurrency in my mind, so why not go the other route, have concurrency with enforcement of no shared state and instead only allow message passing and selective receive.

Thats all the rage these days anyways as its easier to reason about. Better still it matches distributed computing models much better.

[+] alphaBetaGamma|14 years ago|reply

Would it be possible to only run the STM code when the program is multi-threaded? This would require the program detecting that it is single or multi threaded (would that be difficult?) and might provide a good trade off: either be single threaded and be fast, or use a lot of threads and be 2-10 times slower than Cython, but without the GIL.

[+] true_religion|14 years ago|reply

I'm a little disapointed. PyPy has just recently been becoming a fast attractive alternative to CPython (or Cython compilation).

If they go the STM route and spend years hashing out the details only to be 2x as slow as CPython, then that cuts out a huge number of people who would like to practically use PyPy.

[+] megaman821|14 years ago|reply

I am pretty sure that is 2x slower than PyPy not CPython. So it would basically end up being over 2x faster than CPython.

[+] Niedar|14 years ago|reply

I think he might have meant 2x slower than without STM and not 2x slower than CPython. Also from everything I have read you would still have the option to not use STM at all and just use the GIL which would not come with the extra overhead.

[+] DasIch|14 years ago|reply

How would this be ever an issue? You could simply compile PyPy without STM just as you can compile PyPy without the JIT or with a different GC.

[+] Hoff|14 years ago|reply

STM and locks have some interesting clean-up and (as is cited in the article) you really need for it not to crash and not to deadlock or go all priority-inverted on you.

Message-passing, on the other hand, doesn't need particular locking and crashes are (somewhat) simpler to deal with.

(And most any scheme I've encountered can deadlock, and will usually choose the most inopportune moment to demonstrate this capability.)

There's an older blog post here http://enfranchisedmind.com/blog/posts/thoughts-on-paralleli... that's worth a read.

[+] aidenn0|14 years ago|reply

    With this in mind, it is possible to imagine a whole-
    program transformation that would add locking on every 
    object manipulated in RPython by the interpreter. This 
    would end up in a situation similar to Jython. However, 
    it would not automatically solve the issue of deadlocks, 
    which is avoided in the case of Jython by careful manual 
    placement of the locks. (In fact, being deadlock-free is 
    a global program property that cannot be automatically 
    ensured or verified; any change to Jython can in theory 
    break this property, and thus introduce subtle deadlocks. 
    The same applies to non-atomicity.)

This is just not true. It is true for the general case, but not for this specific case. Remember the pithy solution to the dining philosophers (number all the chopsticks and pick up the lowest numbered one first)?

I don't know the details of the pypy interpreter, but it does seem that given a byte-code that affects several objects, you could just acquire the locks for the objects as ordered by the objects address.

Now the pypy guys are very smart and this is a simple solution, so I'm sure it yields horrible performance, but the fact of the matter is it's not true that "being deadlock-free is a global program property that cannot be automatically ensured" (it is true that it can't be verified, but if you generate all the locking code it can be ensured by strict ordering) and to say otherwise is wrong.

[+] scott_s|14 years ago|reply

You missed this point: and moreover if you need to have accessed the first object before you can decide that you will need to access the second object

That is, he's talking about situations where they don't have full knowledge when entering a critical section which objects will be accessed. Hence, they can't order lock acquisition in such a way to avoid deadlock.

[+] jamwt|14 years ago|reply

We are using Haskell + STM for scaling a major multicore system at Bump, and it's been exceedingly awesome thus far. However, STM happens to suit Haskell's type system-enforced purity particularly well; it's hard to see this being anything but a monumental undertaking in a language like Python.

[+] chalst|14 years ago|reply

Does this proposal entail writing Python code without mutation of non-local state? Isn't this harder in Python than in Haskell or Ocaml?

[+] masklinn|14 years ago|reply

> Does this proposal entail writing Python code without mutation of non-local state?

Pypy using STM should not change the semantics of the Python language or require alterations to code which does not rely on implementation details (code which relies on the GIL for its thread-safety may have to be fixed, it already has to be fixed for Jython compatibility).

> Isn't this harder in Python than in Haskell or Ocaml?

I'd expect STM to be much harder in imperative languages than in functional languages indeed. In fact, I don't think I've heard of a successful STM implementation in imperative languages yet.

On the other hand, I think most tentatives so far have been with STM as a library applied explicitly, and you run into issues of (mutable) objects "crossing" the transaction barrier in imperative languages, issue which "does not" happen in Haskell (there are ways to subvert the type system, but then you're on your own. In normal uses, I think objects under STM live in a different monad, and therefore can't be freely mixed with objects living outside of it) (I believe this issue would happen in OCaml, so on that front OCaml would be little better than imperative languages). This issue should not exist if all of the code runs under STM (ignoring explicit hooks to do so built into the system), and this should allow for "easy" implementation of explicit memory transactions as well (one of STM's promises is composability, so you should be able to compose multiple bytecode-transactions in a single explicit transaction).

[+] surrealize|14 years ago|reply

There's a great, very detailed retrospective on Microsoft's STM implementation efforts here:

http://www.bluebytesoftware.com/blog/2010/01/03/ABriefRetros...

[+] albertzeyer|14 years ago|reply

After reading this, I thought a bit about other possible solutions. Esp. more like Jython is doing it but in an automatic way. I came up with this: https://github.com/albertz/automatic_object_locking

[+] albertzeyer|14 years ago|reply

Afaik, all builtin functions are considered to be atomic.

Wouldn't it be enough to have a mutex for each object and when such a builtin function is called, all objects which are passed as parameters (including and esp. `self`) are locked?

[+] exDM69|14 years ago|reply

No, that is not enough. If you want to implement atomicity for operations that touch multiple objects, a simple mutex-per-object approach just won't cut it. If you start holding multiple mutexes at once, there will be deadlocking if they are not locked in a consistent order.

Not to mention that if every object has a mutex associated, the performance will be horrible. Even if the mutexes are not contended, there will be so many atomic compare and swap operations that the interprocessor memory bus will simply die.

If your solution would work, every method in Java would be synchronized by default.

[+] scott_s|14 years ago|reply

That's essentially what Jython does. He discusses why he does not want to go in that direction in the article.

63 comments