Why Python Runs Slow, Part 1: Data Structures

[+] jnbiche|12 years ago|reply

If you're using Python for performance-critical applications, simple use of the integrated ctypes module (uses libffi in background) can make a world of difference. There's a modest performance overhead, but basically you're getting near C performance with proper use of ctypes.

Coincidentally, one of the usual examples given in the ctypes howto is a Point structure, just like in this post. It's simple:

  from ctypes import *

  class Point(Structure):
      _fields_ = [("x", c_int), ("y", c_int)]

Then you can use the Point class the same way you'd use a regular Python class:

  p = Point(3, 4)
  p.x == 3
  p.y == 4

Really, taking a half-day to learn how to use ctypes effectively can make a world of difference to your performance-critical Python code, when you need to stop and think about data structures. Actually, if you already know C, it's less than a half-day to learn... just an hour or so to read the basic documentation:

http://docs.python.org/2/library/ctypes.html

If you plan to write an entire application using ctypes, it'd be worth looking at Cython, which is another incredible project. But for just one or two data structures in a program, ctypes is perfect.

And best of all, ctypes is included in any regular Python distribution -- no need to install Cython or any additional software. Just run your Pythons script like you normally do.

EDIT: I was just demonstrating how to get an easy-speed up using ctypes, which is included in the Python standard library. To be clear, you would usually use this type of data structure in ctypes along with a function from a C shared library.

Furthermore, if you're serious about optimizations, and you can permit additional dependencies, you should absolutely look at cython and/or numpy, both of which are much faster than ctypes, although they do bring along additional complexity. Other commenters are also pointing out cffi, which I've never used but also bears consideration.

[+] absherwin|12 years ago|reply

Using ctypes here is slower in CPython 2.7.5 than a simple class. Summing 10000 points takes 1.81ms using a simple class vs 2.22ms using the code above. ctypes is designed for compatibility not speed.

Using slots is 0.06ms faster while namedtuple is slower.

A bigger speedup in CPython can be achieved by just using the builtin sum function twice to avoid the for loop.

Here's a gist: https://gist.github.com/absherwin/8976587

[+] acdha|12 years ago|reply

You might also want to look at cffi – it's not in stdlib but the interface is pretty nice for this kind of thing:

http://cffi.readthedocs.org/

In the specific example above, however, I'd try to reduce that overhead by going even more C-style and allocating an array or two so you avoid the overhead of individual instances and instead have one Python object for many elements.

[+] Sauliusl|12 years ago|reply

This would be one of the approaches to use if you wanted to squeeze the last pieces of performance out of this class. I would recommend using Cython for this though.

The problem with both of these approaches is that, well, use this too often, however, and you suddenly realise you are not really coding in Python anymore.

Also, correct me if I'm wrong, overuse of tricks like these is one of the key reasons why we cannot have nice things like PyPy for all modules.

[+] untothebreach|12 years ago|reply

I am a python programmer who has yet to dive into ctypes, and I am curious if you have ever played with cffi[1], and if so, thoughts? The project seems to be aimed at the same use cases that you would use ctypes for.

1: http://cffi.readthedocs.org/en/release-0.8/

[+] walshemj|12 years ago|reply

If you using python for performance critical work would you not prototype it in python and then rewrite it in C C++ or even Fortan if its HPC

[+] erbdex|12 years ago|reply

"My experience in working on Graphite has reaffirmed a belief of mine that scalability has very little to do with low-level performance but instead is a product of overall design. I have run into many bottlenecks along the way but each time I look for improvements in design rather than speed-ups in performance. I have been asked many times why I wrote Graphite in Python rather than Java or C++, and my response is always that I have yet to come across a true need for the performance that another language could offer. In [Knu74], Donald Knuth famously said that premature optimization is the root of all evil. As long as we assume that our code will continue to evolve in non-trivial ways then all optimization6 is in some sense premature.

For example, when I first wrote whisper I was convinced that it would have to be rewritten in C for speed and that my Python implementation would only serve as a prototype. If I weren't under a time-crunch I very well may have skipped the Python implementation entirely. It turns out however that I/O is a bottleneck so much earlier than CPU that the lesser efficiency of Python hardly matters at all in practice."

-- Chris Davis, ex Google, Graphite creator

[+] jnbiche|12 years ago|reply

This is the key. Very often, the app is IO bound, and moving to C will make little difference.

This is one reason why I'm very excited about Python's new AsyncIO module in Python 3 only. It's asyncore done right, and is a great way to write network applications. I look forward to seeing what is built on top.

[+] munificent|12 years ago|reply

What I find interesting here is that Python (and most other dynamically-typed languages) treat instances of classes as arbitrary bags of properties, with the negative performance implications of that, even though that feature is rarely used.

I'd love to have the time to examine codebases and get real data, but my strong hunch is that in Python and Ruby, most of the time every instance of a class has the exact same set of fields and methods. These languages pay a large performance penalty for all field accesses, to enable a rare use case.

JavaScript VMs (and maybe Ruby >=1.9?) don't pay the perf cost because they do very advanced optimizations ("hidden classes" in V8 et. al.) to handle this. But then they pay the cost of the complexity of implementing that optimization.

I'm working on a little dynamically-typed scripting language[1] and one design decision I made was to make the set of fields in an object statically-determinable. Like Ruby, it uses a distinct syntax for fields, so we can tell just by parsing the full set a class uses.

My implementation is less than 5k lines of not-very-advanced C code, and it runs the DeltaBlue[2] benchmark about three times faster than CPython.

People think you need static types for efficiency, but I've found you can get quite far just having static shape. You can still be fully dynamically-dispatched and dynamically type your variables, but locking down the fields and methods of an object simplifies a lot of things. You lose some metaprogramming flexibility, but I'm interesting in seeing how much of a trade-off that really is in practice.

[1] https://github.com/munificent/wren [2] https://github.com/munificent/wren/blob/master/benchmark/del...

[+] gopalv|12 years ago|reply

Python actually comes static shapes for objects with __slots__

class vertex(object):

   __slots__ = ["x", "y", "z"]

I use __slots__ in most code after the first refactor for a very important reason - I'm a hyper-caffeinated code monkey.

I wasted an entire day because I set self.peices in one place and self.pieces in another & couldn't figure out the bug, until I used the "grep method".

[+] lispm|12 years ago|reply

Common Lisp implementations implement CLOS and structure objects as vectors. In high-performance implementations CLOS dynamism can be reduced so that slot-accessors can be compiled to inlined vector references.

[+] andreasvc|12 years ago|reply

I think this is a very crucial point. There are dynamic languages which can be implemented efficiently (e.g., Lisp), and with others it is difficult because of fundamental features of the language. Another example is monkey patching, e.g., replacing a method of an object at runtime. Restricting such language features and thereby speeding up the interpreter would have been an ideal selling point for Python 3, in my opinion, but of course it's too late now.

[+] cdavid|12 years ago|reply

the feature is maybe rarely used directly, but is crucial to make dynamic typing actually useful. Things like SQLAlchemy are possible thanks to the dynamic aspect of python. Writing a simple plugin system in python is trivial. Mocking is more or less trivial in those languages as well.

[+] unknown|12 years ago|reply

[deleted]

[+] alephnil|12 years ago|reply

It is of cause possible to write a faster Python, as PyPy proves, but some of the design choices in Python does make it hard to optimize. This is not inherently a problem with dynamic languages as the development of Julia proves. In Julia it is easy to write programs that is in the same ballpark as C efficiency wise, while still feeling very dynamic and having a REPL. Pythons problem is that both the interpreter and the language is quite old, and existed before modern JIT technology was developed, so that was not considered when designing the language and the interpreter.

That said, Python emphasis on fast development as opposed to fast running code is very often the right tradeoff, and is why it is so popular.

[+] pjmlp|12 years ago|reply

Yes, the sacred place C still enjoys is many times related to lack of knowledge how to use the tools properly.

With proper implementations and correct use of data structures, there are seldom reasons to use C on user space applications.

[+] jl6|12 years ago|reply

What language features would you have or not have if you designed a language with knowledge of "modern JIT technology"?

[+] jzwinck|12 years ago|reply

There's no silver bullet, but NumPy comes pretty close sometimes. Here's an example where I achieved a 300x speedup with regular old CPython: http://stackoverflow.com/questions/17529342/need-help-vector...

The trick is often to somehow get Python to hand the "real work" off to something implemented in C, Fortran, C++, etc. That's pretty easy for a lot of numerical and scientific applications thanks to NumPy, SciPy, Pandas, PIL, and more. Lots of libraries are exposed to Python, and you can expose more either by compiling bindings (see Boost.Python) or using ctypes. If you do this well, nobody will notice the speed difference, yet your code will still look like Python.

And sometimes your performance problem is not Python's fault at all, such as this lovely performance bug: https://github.com/paramiko/paramiko/issues/175 - a 5-10x speedup can be had simply by tweaking one parameter that is no fault of CPython or the GIL or anything else (in fact, Go had basically the same bug).

What I'm getting at here is that performance is not just one thing, and the GIL is a real and worthy spectre but hardly matters for most applications where Python is relevant. Your Python is slow for the same reason that your Bash and your C and your Javascript and your VBA are slow: you haven't profiled enough, you haven't thought long and hard with empirical results in front of you about where your program is spending its time, your program is improperly factored and doing work in the wrong order, or perhaps you should just throw some hardware at it!

[+] crntaylor|12 years ago|reply

The part that really surprised me was this:

  We only had 23 years of Python interpreter development,
  how would things look like when Python is 42, like C?

C, which I always think of as an ancient venerable systems language, is less than twice as old as Python, which I think of as a hot new kid on the block.

In thirty years, when my career will probably be drawing to a close, Python will be 53 years old and C will be 72 years old. Barely any difference at all.

[+] joshmaker|12 years ago|reply

Who would use a dictionary for a point? If it's a 2D grid, you are only going to have two axes and by convention x-axis is always first, and y-axis is always second. Perfect use case for a tuple.

   point = (0, 0)

[+] kashif|12 years ago|reply

The pythonic way of creating a Point like data structure is not

  point = {'x': 0, 'y': 0}

but instead, this

  Point = namedtuple('Point', ['x', 'y'], verbose=True)

The resulting class, I believe to be more performant than using a dictionary - but I haven't actually measured this.

[+] mjschultz|12 years ago|reply

I'm probably doing it wrong but I measured exactly this yesterday and the dict version was faster:

    $ python -m timeit --setup "from collections import namedtuple; Point = namedtuple('Point', ['x', 'y']); p = Point(x=0, y=0)" "p.x + p.y"
    1000000 loops, best of 3: 0.284 usec per loop

vs.

    $ python -m timeit --setup "p = {'x': 0, 'y': 0}" "p['x'] + p['y']"
    10000000 loops, best of 3: 0.0737 usec per loop

Maybe the use isn't right because I agree with your belief that namedtuple is suppose to be more performant.

[+] NigelTufnel|12 years ago|reply

I did a stupid microbenchmark and it turned out that simple class with __slots__ was the fastest method.

[+] adnam|12 years ago|reply

Would't it be more like this?

    points = [(0, 0), (3, 4), ...]
    sum_x = sum([x for x,_ in points])
    sum_y = sum([y for _,y in points])

[+] tostitos1979|12 years ago|reply

I'm curious where one would go to learn specifically about performance issues in Python - patterns to avoid, and those to stick to. Any recommendations?

[+] ramchip|12 years ago|reply

It's mentioned in the article, albeit at the end.

[+] freyrs3|12 years ago|reply

> The resulting class, I believe to be more performant than using a dictionary - but I haven't actually measured this.

If you didn't measure it then how did you come to believe this?

[+] a13xnet|12 years ago|reply

Hehe, came here to say exactly that.

[+] unwind|12 years ago|reply

The article was kind of interesting, it's always good to talk about why things are slow, since that might help highlight what you can do about it (short of switching away from Python, that is).

I was abit annoyed about the benchmarks part; there was no C-based benchmark for comparison, it was not clear how many points were being processed which made the exact time kind of pointless. Also, C and C++ are not the same which the author seems to think.

[+] craigching|12 years ago|reply

> C and C++ are not the same which the author seems to think.

The part where he lost me was when he said this referring to C:

> Here we tell the interpreter that each point would have exactly two fields, x and y.

Maybe I misread that, but I don't think I did, he was referring to the C struct when he said that. It's one of those glaring mistakes that is hard for me to look past when someone says something like this.

[+] Sauliusl|12 years ago|reply

Author here:

There is a difference between C/C++, but for the purposes of this article it does not matter -- the point I am trying to make is that you would not use a hash table in any of them, would you?

The exact times are kind of pointless indeed, all what matters are the ratios. C benchmark would probably be as well. What I was trying to show was that your runtime comparisons are not necessarily comparing the same things.

[+] shultays|12 years ago|reply

He probably didn't cared enough to find/create a hash map for C

[+] jaimebuelta|12 years ago|reply

The sentence "Python run slow" is a little misleading. While is true that does more stuff than other languages (C, for example) for similar operations, it is also true that in the majority of cases, the relevant time consuming parts of execution are in other areas, like waiting for I/O, etc.

In most of the cases, a program done in Python (or Ruby, or JavaScript) is not slower than a program done in C or even assembly. Sure, for some cases, it may be necessary to take care of rough cases, but I think that just saying "Python is slow" is not very fortunate...

[+] arocks|12 years ago|reply

Rather than using a better data structure from the start (class vis-a-vis dictionaries), why not move to a better Python implementation (PyPy vis-a-vis CPython)? When I need a data structure, I look for one that is a natural fit to the problem. Perhaps everything is implemented as a hash table in a certain language (not taking names here :)) but would I replace every data structure with a hash table for efficiency? That would be Premature Optimisation.

I would recommend to code in idiomatic Python as much as possible (in this case, use a tuple for Point). Primarily because it is easy for other fellow programmers and your future self to understand. Performance optimisations are best done by the compiler for a reason, it is usually a one-way path that affects readability.

[+] GFK_of_xmaspast|12 years ago|reply

I can't use pypy yet, I need scipy/numpy.

[+] analog31|12 years ago|reply

For my casual use, Python has been fast enough that I can be pretty carefree about how I program. But I had one shocking experience when I ported a program with a fair amount of math over to a Raspberry Pi, and it practically ground to a halt. That was a disappointment. It won't drive me away from Python, but it means I've got some learning to do on how to better optimize Python code.

[+] tgb|12 years ago|reply

This should be bringing up the classic array-of-structures versus structure-of-arrays debate instead of what was mentioned. If only NumPy were standard Python and so we could have an easy, standard answer that is usually fast enough.

[+] pekk|12 years ago|reply

what is stopping you from using NumPy?

[+] rtpg|12 years ago|reply

I'm surprised that Python's interpreter tech isn't advanced enough to try and JIT some of this away. Python could probably learn a lot from various JS interpreter advances.

[+] jaimebuelta|12 years ago|reply

The PyPy project is doing that http://pypy.org/ (is mentioned on the article)

[+] zurn|12 years ago|reply

On a related note, the field of compilers seems to have ignored data optimization almost completely at the expense of code optimization.

Applying a few basic tricks lets a human typically save 50-75% off an object's storage requirements and a compiler could combine this with profile feedback data about access patterns & making it more cache friendly.

This would be a good fit for JIT where the compiler could do deopt when the assumptions are violated, and this would let you make more aggressive optimizations.

[+] tasty_freeze|12 years ago|reply

This article is making a misleading comparison. Yes, dynamically typed languages have overheads that statically typed languages don't. But the better question to ask is why Python implementations are factors slower than leading javascript implementations, which have the same theoretical overheads that Python does.

The V8 is is now about a factor of 4 or 5 off of the speed of C for general purpose code, while CPython is 30x to 100x.

[+] hessenwolf|12 years ago|reply

Runs 'slowly'. Led by the population of the North American continent, the adverb is dying gradual.

[+] rplnt|12 years ago|reply

Yet another case of someone (the blog engine?) being too smart and replacing " with “ and ” in code.

[+] mumrah|12 years ago|reply

If you don't want to mess with ctypes, this sounds like a straightforward use case for namedtuple.

[+] minimax|12 years ago|reply

    std::hash_set<std::string, int> point;
    point[“x”] = x
    point[“y”] = y

This is a very interesting implementation of a set...

162 comments