Why Python, Ruby, and Javascript are Slow

[+] DannyBee|13 years ago|reply

Speaking as a compiler guy, and having a hand in a few successful commercial JITs: The only reason he thinks they aren't slow is because they haven't yet reached the limits of making the JIT faster vs the program faster. Yes, it's true that the languages are not slow in the sense of being able to take care of most situations through better optimization strategies. As a compiler author, one can do things like profile types/trace/whatever, and deoptimize if you get it wrong. You can do a lot. You can recognize idioms, use different representations behind people's back, etc.

But all those things take time that is not spent running your program. On average, you can do pretty well. But it's still overhead. As you get farther along in your JIT, optimization algorithms get trickier and trickier, your heuristics, more complex. You will eventually hit the wall, and need to spend more time doing JIT'ing than doing real work to make optimizations to some code. This happens to every single JIT, of course. This is why they try to figure out which code to optimize. But even then, you may find there is too much of it.

Because of this, the languages are slower, it's just the overhead of better JIT algorithms, not slower code. In practice, you hope that you can optimize enough code well enough that nobody cares, because the ruby code takes 8ms, and the C code takes 5ms.

For example: Almost all of the allocations and copying can be optimized, but depending on the language, the algorithms to figure out what you can do safely may be N^3.

Also, PyPy is still pretty young in its life cycle (in this iteration of PyPy:P) for folks to say that they can make stuff much faster if they only had a few things. It really needs a very large set of production apps being rin by a very large set of folks for quite a while to see where the real bottlenecks still are. Past a certain point, you run out of optimization algorithm bullets. The way compilers get the last 20% is by tuning the algorithms for 10 years.

Of course, i'm not trying to slag on PyPy, I think they've done an amazing job of persevering through multiple rewrites to get somewhere that seems to be quite good now. I just am a little wary of a fairly young JIT saying that all big performance problems fall into a few categories.

[+] sb|13 years ago|reply

Interesting point of view, the problem in compiler construction is well known ("Proebsting's law", though it says it's more like 18 years instead of 10.)

The issue with benchmarks is surely well known, also by the PyPy authors; I wonder what the biggest application is that they have benchmarked or that runs on PyPy.

Your point on the JIT compiler interrupting program execution is certainly valid, too, but not necessarily so. One could easily do the code generation in a separate background thread and let execution switch over only if necessary. But, as you have already said, a latency issue certainly exists. This is one of the cases where interpreters usually have a leg up, and there are promising ways of optimizing interpreters.

[+] fijal|13 years ago|reply

We're in no shortage of production code. The problems are quite trivial - two people doing the same think might have different stuff in mind (so you cannot have a heuristic that works for everyone) and the fact that we're running on a shoestring budget compared to other JIT-for-dynamic-language projects. We don't even have 3 people full time.

As for the "single thing" - it's just the next thing on the infinite list of things that can be optimized better. Having a better assembler backend would be good (but smaller) win, etc. etc. For now and for quite a bit in the future, it's clear what to do to make X Y or Z faster.

[+] evincarofautumn|13 years ago|reply

I develop a compiler for ActionScript 3. Though it’s not a great language, it does have a few distinct advantages over its cousin JavaScript. First, type annotations. Obviously the more knowledge you have about the structure of the program, the better you can help it run well. Having a class model instead of a prototype model also helps—much as I like working with prototypes—because you can easily eliminate late binding (“hash lookups”) and allocations, two of the performance killers mentioned in the slides. The operative word there is easily. You can analyse and JIT all you want, but you cannot feasibly solve at runtime the fundamental constraints imposed by the language.

Say a compiler author needs twice as much effort and cleverness to make programs in language X run fast than for language Y. That means that—all other things being equal—implementations of X will be twice as slow as Y for the same basic quality of implementation.

[+] vidarh|13 years ago|reply

> For example: Almost all of the allocations and copying can be optimized, but depending on the language, the algorithms to figure out what you can do safely may be N^3.

... and this is pretty much his point: He can keep optimizing, but the moment you start passing complex objects around and copying them all over the place, instead of passing raw buffers around and operating on them in place, you've massively raised the bar in terms of the complexity of the necessary optimizations needed.

[+] cpprototypes|13 years ago|reply

So does the JIT optimization algorithms depend on popular conventions and patterns of how people write code with a language? Like in JS theres bad language features people avoid and certain patterns on how to code something. If such things changed would the optimizations start failing? I guess I'm wondering if speed is somewhat related to trends in that language.

[+] jewel|13 years ago|reply

To counter the overhead of the JIT algorithm, would it be possible to do that work in a background thread or cache the JIT optimizations to disk?

[+] pcwalton|13 years ago|reply

Related to this is the importance of deforestation. Some good links:

* http://en.wikipedia.org/wiki/Deforestation_%28computer_scien...

* http://www.haskell.org/haskellwiki/Short_cut_fusion

Deforestation is basically eliminating intermediate data structures, which is similar to what the "int(s.split("-", 1)[1])" versus "atoi(strchr(s, '-') + 1)" slides are about. If you consider strings as just lists of characters, then it's basically a deforestation problem: the goal is to eliminate all the intermediate lists of lists that are constructed. (It's something of a peculiar case though, because in order to transform into the C code you need to not only observe that indexing an rvalue via [1] and throwing the rest away means that the list doesn't have to be constructed at all, but you also need to allow strings to share underlying buffer space—the latter optimization isn't deforestation per se.)

I don't know if there's been much effort into deforestation optimizations for dynamic languages, but perhaps this is an area that compilers and research should be focusing on more.

On another minor note, I do think that the deck is a little too quick to dismiss garbage collection as an irrelevant problem. For most server apps I'm totally willing to believe that GC doesn't matter, but for interactive apps on the client (think touch-sensitive mobile apps and games) where you have to render each frame in under 16 ms, unpredictable latency starts to matter a lot.

[+] cwzwarich|13 years ago|reply

Automatic deforestation can remove intermediate results in a pipeline of computation, but it can not rewrite a program that is based around the querying / updating of a fixed data structure to use an efficient imperative equivalent throughout.

[+] lucian1900|13 years ago|reply

Deforestation is easily done in lazy languages like Haskell.

As for GC, it would be nice to have good real time GCs in runtimes.

[+] Nursie|13 years ago|reply

Try telling that to a dedicated C++ guy. Apparently the C calls are dangerous, dirty and just to be avoided.

[+] cschmidt|13 years ago|reply

A nice talk. The punchline for me was:

    Things that take time
     •Hash table lookups
     •Allocations
     •Copying

Interestingly, that's exactly how you write fast C++ code. His point is that languages like Python lack good API's for preallocating memory.

[+] kyllo|13 years ago|reply

It's how you write fast algorithms in general, in any programming language. Minimize the number of reads and writes per iteration/recursion.

In higher-level programming languages, it's just a bit harder to control the number of reads and writes because you're working at several layers of abstraction above them, and are concerned with solving higher-level problems. Use the language that provides the appropriate level of abstraction for the problem you're trying to solve.

[+] njharman|13 years ago|reply

Meh, MEH.

I'm almost never waiting on my python code. I'm waiting on network or disk or database or joe to check in his changes or etc.

I'm sure there are people who do wait. But that's why numpy, c extensions, all the pypy, psycho, and similar things exist.

Python and more broadly "scripting" languages are for speed of development. Something else can take on speed of execution faster than 90% of people need it to be.

[+] tptacek|13 years ago|reply

It's not an unresolved question whether idiomatic Python is slower than idiomatic C/C++ for solving comparable problems. Python is much, much slower than C.

[+] mistercow|13 years ago|reply

Speed in Python (or Ruby, or JS) isn't a big deal... until it is. When that happens, would you rather have to switch over to C and glue the resulting binary in (assuming you're not using JS, in which case you're just SOL), or would you rather have a high performance API at your fingertips for optimization when you need it?

[+] MBlume|13 years ago|reply

If you're completely satisfied with Python's performance, that's wonderful. It means the talk wasn't aimed at you. Move along.

[+] coldtea|13 years ago|reply

>Meh, MEH. I'm almost never waiting on my python code. I'm waiting on network or disk or database or joe to check in his changes or etc.

Meh, MEH. That's because you don't do anything involved with your Python code.

>I'm sure there are people who do wait. But that's why numpy, c extensions, all the pypy, psycho, and similar things exist.

That they HAVE to exist could also be considered a sad state of affairs though. With a faster language you would just use the language, not external extensions and tricks.

[+] lucian1900|13 years ago|reply

Alex's point is that Python on PyPy is trivially comparable to those C extensions in speed. So why give up Python, ever, if JITs are this good?

[+] defen|13 years ago|reply

Back when I wanted to investigate the numeric performance of v8 I wrote a Runge-Kutta integrator + Lorenz attractor in C and in JavaScript as a simple-but-not-entirely-trivial benchmark. I was actually pretty impressed with how fast the v8 version was. On the downside, it's fairly non-idiomatic js and not that much nicer to look at than the C. Doing a million steps on my machine takes 0.65 seconds in node.js v0.8.4, 0.41 seconds in C compiled with gcc -O0, and 0.13 seconds with gcc -O3. Here is the code if anyone is interested. Note that it's not commented, not thread-safe, and doesn't free memory, so use at your own risk :)

https://gist.github.com/anonymous/5066486

    gcc strange.c rk4.c; ./a.out

    node strange.js

[+] masklinn|13 years ago|reply

> Back when I wanted to investigate the numeric performance of v8

Straightforward numerical computations really isn't a good jit benchmark, because numerical computations are by far the easiest thing to JIT, and JITted perfs are going to be much closer to AOT than in the general case (unless the problem can be vectorized an the AOT compiler is vectorizing, I don't think JITs can usually vectorize)

[+] moreati|13 years ago|reply

Great presentation, thank you for making me aware of an aspect of Python performance. One slide struck me as odd - the "basically pythonic" squares() function. I understand it's a chosen example to illustrate a point, I just hope people aren't writing loops like that. You inspired me to measure it

    $ cat squares.py
    def squares_append(n):
        sq = []
        for i in xrange(n):
            sq.append(i*i)
        return sq

    def squares_comprehension(n):
        return [i*i for i in xrange(n)]
    $ PYTHONPATH=. python -m timeit -s "from squares import squares_append" "squares_append(1000)"
    10000 loops, best of 3: 148 usec per loop
    $ PYTHONPATH=. python -m timeit -s "from squares import squares_comprehension" "squares_comprehension(1000)"
    10000 loops, best of 3: 74.1 usec per loop
    $ PYTHONPATH=. pypy -m timeit -s "from squares import squares_append" "squares_append(1000)"
    10000 loops, best of 3: 46.9 usec per loop
    $ PYTHONPATH=. pypy -m timeit -s "from squares import squares_comprehension" "squares_comprehension(1000)"
    100000 loops, best of 3: 8.67 usec per loop

I'm curious to know how many allocations/copies a list comprehension saves in CPython/PyPy. However I wouldn't begin to know how to measure it.

[+] ericmoritz|13 years ago|reply

If you really want power, use NumPy:

    from numpy import arange

    def squares_numpy(n):
        a = arange(n)
        return a * a

    $ python -m timeit -s "from squares import squares_append" "squares_append(1000)"
    10000 loops, best of 3: 130 usec per loop
    $ python -m timeit -s "from squares import squares_comprehension" "squares_comprehension(1000)"
    10000 loops, best of 3: 95.4 usec per loop
    $ python -m timeit -s "from squares import squares_numpy" "squares_numpy(1000)"
    100000 loops, best of 3: 5.31 usec per loop

[+] Zak|13 years ago|reply

The creators of Common Lisp knew what Alex is talking about. Lisp is, of course just as dynamic as Ruby, Python or Javascript, but it exposes lower-level details about data structures and memory allocation iff the programmer wants them.

Features that come to mind include preallocated vectors (fixed-size or growable), non-consing versions of the standard list functions and the ability to bang on most any piece of data in place. There are fairly few situations in which a CL program can't come within a factor of 2 or 3 of the performance of C.

[+] pjmlp|13 years ago|reply

In the early 80's there was a time Lisp compilers could even beat FORTRAN for floating point computations.

http://www.cs.berkeley.edu/~fateman/papers/lispfloat.pdf

[+] wheaties|13 years ago|reply

Great bit of slides. Straight and to the point. If you've ever ventured under the hood of Python you'd see this in the code. If you've ever had to optimize the bejeesus out of code in C++ or C, you'd know exactly the kinds of things he's talking about.

[+] kingkilr|13 years ago|reply

Author/speaker here:

I don't have time to read all the comments now (thanks for all the interest though!). I just want to say I think when the video comes out it'll answer a lot of questions people are having.

[+] jholman|13 years ago|reply

I'm looking forward to the video. I'm also interested in proof of the "lame myths" claims, or links to debunkings of those myths, etc. And also if you have rants about those If There's Time topics in your last slide, I'd like to read those too. Thanks!

[+] meunier|13 years ago|reply

Someone actually posting notes with slides! It's a miracle!

[+] NateDad|13 years ago|reply

yes thank god. I almost skipped it when I saw it was a slide deck, until I saw the notes. I hate it when people link to a slide deck with no notes. It's almost completely useless.

[+] riobard|13 years ago|reply

Completely agree. APIs are so important for many optimizations to pull off.

I'd really like to use a lot more buffer()/memoryview() objects in Python. Unfortunately many APIs (e.g. sockets) won't work well with them (at least in Python 2.x. Not sure about 3.x).

So we ended up with tons of unnecessary allocation and copying all over the place. So sad.

[+] dicroce|13 years ago|reply

As a C/C++ programmer I find these slides kind of amusing... These languages are popular because they make things simpler, and his suggestions may very well get a nicely jit'd language on par with C, but I suspect you'll then have the same problems C does (complexity).

[+] mistercow|13 years ago|reply

I don't think that added complexity invalidates the usefulness of high performance APIs in high level languages. The point would not be to write all of your code to be highly performant (that would be premature optimization), but to optimize the hot spots.

Currently, if you want to really optimize a hot spot in, say, Python, your only real option is to write that part in C. Then you have all the additional complexity of gluing that into your Python program, along with portability concerns and a more complex build process. It would be so much easier if there were a way to sacrifice local simplicity and idiom for performance while still staying in the language.

And in the case of JS, I'm not sure you have much in the way of options at all for optimizing hot spots to reduce allocations. Maybe you could write it in C and compile it to JS via emscripten? I don't know if that would even help currently, but maybe if asm.js takes off. But once again, wouldn't you rather sacrifice a small amount of elegance for performance rather than switching languages?

[+] graue|13 years ago|reply

Depending on the application, speed may matter only for 5% (or less) of your code. If you can write that 5% in Ruby or Python, same as you wrote the other 95% in, you save a lot of difficulty compared to calling out to a C library — even if the optimized part of the code is a bit complex.

[+] Symmetry|13 years ago|reply

There's a lot to be said for writing everything out in a simple manner for the first pass, then profiling and adding complexity to the places where there's a big benefit from it. And if something goes wrong what you get is bad performance, not a segfault.

[+] CJefferson|13 years ago|reply

One main thought on this topic -- languages like Haskell and lisp also have very poor support for direct memory control, but tend to be viewed (perhaps untruthfully?) as much closer in performance to C than Python/Ruby.

[+] andolanra|13 years ago|reply

Haskell and languages in the ML family have a lot of opportunities for elaborate static analysis, which often allows the resulting programs to be quite clever about optimizing the resulting programs. As one example, the GHC Haskell compiler uses loop fusion to combine multiple passes over a list into a single pass with no intermediate copies of the list produced. Consequently, Haskell code like

    map f (map g (map h someList))

is going to involve allocating exactly one list of the same size as someList, while a direct translation into Python

    map(f, map(g, map(h, someList)))

is going to involve the creation of several intermediate lists.

[+] tikhonj|13 years ago|reply

No, Haskell really does have significantly better performance than Python or Ruby. Same with some Lisp variants as well--there was even a research whole-program optimizing compiler for Scheme that apparently sometimes beat hand-optimized C (Stalin Scheme).

So one way to improve performance is simply by having a good compiler. And GHC, at least, is a very good compiler.

Also, Haskell support for direct memory control is not that bad. In fact, in some ways, it's better than even Java--you can have your own unboxed data types which essentially act like structs, for example. This also means that you can have unboxed arrays of more than just primitive types.

Haskell also does some very clever things with both the heap and the stack, but I'm not familiar enough with its internals to comment. My understanding is that Haskell makes heap allocation much cheaper and has a GC optimized for handling lots of small allocations (as you would expect for a functional language).

Ultimately, the point is that the question is fairly nuanced and you won't be able to pin down a single language or implementation feature that uniquely determines performance.

[+] revelation|13 years ago|reply

Looking at CPython and the bytecode it uses, it's not very hard to see why it would be slow. It's basically designed as a reference implementation, with only very tame optimizations.

[+] estavaro|13 years ago|reply

My own piece of feedback based on my experience. The slides were good. But like others, JIT is not all rosy. In V8 and Dart and .NET, code gets compiled to native code as soon as possible. I think that's the best case scenario in general. You then don't have to guess as much.

The author didn't mention method dispatching. I think it's an issue for many languages. In Dart, they tried to optimize it by the specification by mostly eliminating the need to change methods at runtime. In Ruby I watched a video by one of the core Ruby developers and he said that in Ruby method dispatching can be very complicated requiring up to 20 steps to resolve them.

As important as getting the best performance out of programs is to get the programs created in the first place. That's why I'm against shying away from larger codebases. I'm in favor of OO programming exactly because I think getting things done comes first, even if that could complicate the implementation of the toolset. And OO is all about layers of abstractions that bring more performance costs with them.

That said, I absolutely abhor type annotations. They make code hideous and decrease the opportunities for experimentations. Instead of reading a + b = c algorithms, you may need to parse A a + B b = C c source code.

In Dart we have Optional Types. But the core developers are fond of type annotations, so most samples they post come with them. I take relief in being able to omit type annotations while experimenting, researching and ultimately prototyping. Although in a way I feel like a rebel in the community for this disregard. Thankfully there is this chance to share a community with them.

Reading the part that you don't like adding heuristics to help programs to go faster reminded of adding types to them even if they are mostly disregarded as in Dart.

Then again, not all "dynamic languages" are the same. Some are truly dynamic with eval and runtime method changes. Others, not so much. Sometimes the tradeoffs allow for other kinds of gains that could come into play like when deploying. So there is a lot more to it than just getting the algorithms correct.

[+] csense|13 years ago|reply

The example he gives for strings could be optimized to near the efficiency of the C version by a sufficiently smart compiler:

    int(s.split("-", 1)[1])

If the JIT knows that s is the builtin string type and the split() method has not been overridden [1], it can speed this up by using "pseudo-strings," where a pseudo-string is an index and length into another string. This would require only O(1) time and space.

Garbage-collecting pseudo-strings would be an interesting exercise, but I'm sure it's a solvable problem [2] [3].

[1] If the preconditions for your optimization don't hold, you can always fall back to interpreting it. As noted by the speaker, this sort of logic is already a critical part of many JIT's including Pypy.

[2] The problem is actually GC'ing the parent. When the parent string is gc'ed, you have to compact the orphan strings to reclaim the remaining space; otherwise it'll be possible to write user code that uses a small finite amount of memory in CPython but has an unbounded memory leak in your compiler.

[3] You can avoid the trickiness in [2] if the parent string can be proven to outlive its children, which is the case in this example. You could probably optimize a lot of real-world code, and have an easier time implementing the compiler, if you only used pseudo-strings when they could be proven to be shorter-lived than the parent. As a bonus, this partial GC would build some infrastructure that could be recycled in a general implementation.

[+] cheald|13 years ago|reply

Kind of a poorly-named deck. It's really about why programs use features of these languages that end up causing poor performance relative to C, rather than why the individual VMs themselves are slow. It's no surprise that trading the byte-precision of C for the convenience of a garbage collector and heap-allocated data structures results in a performance decrease.

Dynamically-typed languages are often easier to program in, but require more copying (and memory allocation) as a result. Hash tables are heap-allocated and have to be garbage collected, but they're flexible - something you don't get with structs. Allocating and freeing memory has a cost, and that can add up quickly. Your primary line of optimization in most of these languages is "avoid the GC", which really boils down to "don't allocate more than you need to", which is sound advice in every language, scripting or otherwise.

[+] gingerlime|13 years ago|reply

Interesting slides, and good point about having better APIs.

Perhaps I'm nitpicking, but with a function called `newlist_hint`, I struggle to see how anybody would adopt it. I had to go back to the slides maybe 3 times, and I still don't remember the name of this function... Those APIs must have the most obvious, logical and simple names.

[+] wting|13 years ago|reply

I have a few comments about some of the slides, feel free to correct any misunderstandings.

Dictionary vs Object:

Lookups in both data structures is O(1), the difference being the hashing cost (and an additional memory lookup for heap) vs a single memory lookup on the stack (1 line of assembly).

Squares list:

> ... so every iteration through the list we have the potential need to size the list and copy all the data.

This is no different than stl::vector which has an amortized cost of O(1) for a push_back().

It's not going to be as fast as C, but I'd also argue for a generator version instead:

    def squares(n):
        return (i*i for i in xrange(n))

One of the main reasons people choose Python is for expressiveness and not manually managing memory, although pre-allocation does seem like a good idea.

[+] oscargrouch|13 years ago|reply

Its time to face it:

People start to create computer languages without carrying too much about the target processor opcodes (because in that time processor were just getting faster with time) and focus more on programmer convenience, and wild beasts like python and ruby were born..

C is fast because it was created with processor awareness in mind.. pretty simple...

these days kids are all about trying to create more and more crappy convenient sintax languages.. and they get worry when the languages dont scale? for what computer they design the language? from venus ?

nobody should be doing any serious software in python or ruby.. is such a waste of talent .. use it for education.. for fun.. or for the things they are best.. wich is not in the system/plumbing side of things

200 comments