If you're using Python for performance-critical applications, simple use of the integrated ctypes module (uses libffi in background) can make a world of difference. There's a modest performance overhead, but basically you're getting near C performance with proper use of ctypes.
Coincidentally, one of the usual examples given in the ctypes howto is a Point structure, just like in this post. It's simple:
from ctypes import *
class Point(Structure):
_fields_ = [("x", c_int), ("y", c_int)]
Then you can use the Point class the same way you'd use a regular Python class:
p = Point(3, 4)
p.x == 3
p.y == 4
Really, taking a half-day to learn how to use ctypes effectively can make a world of difference to your performance-critical Python code, when you need to stop and think about data structures. Actually, if you already know C, it's less than a half-day to learn... just an hour or so to read the basic documentation:
If you plan to write an entire application using ctypes, it'd be worth looking at Cython, which is another incredible project. But for just one or two data structures in a program, ctypes is perfect.
And best of all, ctypes is included in any regular Python distribution -- no need to install Cython or any additional software. Just run your Pythons script like you normally do.
EDIT: I was just demonstrating how to get an easy-speed up using ctypes, which is included in the Python standard library. To be clear, you would usually use this type of data structure in ctypes along with a function from a C shared library.
Furthermore, if you're serious about optimizations, and you can permit additional dependencies, you should absolutely look at cython and/or numpy, both of which are much faster than ctypes, although they do bring along additional complexity. Other commenters are also pointing out cffi, which I've never used but also bears consideration.
Using ctypes here is slower in CPython 2.7.5 than a simple class. Summing 10000 points takes 1.81ms using a simple class vs 2.22ms using the code above. ctypes is designed for compatibility not speed.
Using slots is 0.06ms faster while namedtuple is slower.
A bigger speedup in CPython can be achieved by just using the builtin sum function twice to avoid the for loop.
In the specific example above, however, I'd try to reduce that overhead by going even more C-style and allocating an array or two so you avoid the overhead of individual instances and instead have one Python object for many elements.
This would be one of the approaches to use if you wanted to squeeze the last pieces of performance out of this class. I would recommend using Cython for this though.
The problem with both of these approaches is that, well, use this too often, however, and you suddenly realise you are not really coding in Python anymore.
Also, correct me if I'm wrong, overuse of tricks like these is one of the key reasons why we cannot have nice things like PyPy for all modules.
I am a python programmer who has yet to dive into ctypes, and I am curious if you have ever played with cffi[1], and if so, thoughts? The project seems to be aimed at the same use cases that you would use ctypes for.
"My experience in working on Graphite has reaffirmed a belief of mine that scalability has very little to do with low-level performance but instead is a product of overall design. I have run into many bottlenecks along the way but each time I look for improvements in design rather than speed-ups in performance. I have been asked many times why I wrote Graphite in Python rather than Java or C++, and my response is always that I have yet to come across a true need for the performance that another language could offer. In [Knu74], Donald Knuth famously said that premature optimization is the root of all evil. As long as we assume that our code will continue to evolve in non-trivial ways then all optimization6 is in some sense premature.
For example, when I first wrote whisper I was convinced that it would have to be rewritten in C for speed and that my Python implementation would only serve as a prototype. If I weren't under a time-crunch I very well may have skipped the Python implementation entirely. It turns out however that I/O is a bottleneck so much earlier than CPU that the lesser efficiency of Python hardly matters at all in practice."
This is the key. Very often, the app is IO bound, and moving to C will make little difference.
This is one reason why I'm very excited about Python's new AsyncIO module in Python 3 only. It's asyncore done right, and is a great way to write network applications. I look forward to seeing what is built on top.
What I find interesting here is that Python (and most other dynamically-typed languages) treat instances of classes as arbitrary bags of properties, with the negative performance implications of that, even though that feature is rarely used.
I'd love to have the time to examine codebases and get real data, but my strong hunch is that in Python and Ruby, most of the time every instance of a class has the exact same set of fields and methods. These languages pay a large performance penalty for all field accesses, to enable a rare use case.
JavaScript VMs (and maybe Ruby >=1.9?) don't pay the perf cost because they do very advanced optimizations ("hidden classes" in V8 et. al.) to handle this. But then they pay the cost of the complexity of implementing that optimization.
I'm working on a little dynamically-typed scripting language[1] and one design decision I made was to make the set of fields in an object statically-determinable. Like Ruby, it uses a distinct syntax for fields, so we can tell just by parsing the full set a class uses.
My implementation is less than 5k lines of not-very-advanced C code, and it runs the DeltaBlue[2] benchmark about three times faster than CPython.
People think you need static types for efficiency, but I've found you can get quite far just having static shape. You can still be fully dynamically-dispatched and dynamically type your variables, but locking down the fields and methods of an object simplifies a lot of things. You lose some metaprogramming flexibility, but I'm interesting in seeing how much of a trade-off that really is in practice.
Python actually comes static shapes for objects with __slots__
class vertex(object):
__slots__ = ["x", "y", "z"]
I use __slots__ in most code after the first refactor for a very important reason - I'm a hyper-caffeinated code monkey.
I wasted an entire day because I set self.peices in one place and self.pieces in another & couldn't figure out the bug, until I used the "grep method".
Common Lisp implementations implement CLOS and structure objects as vectors. In high-performance implementations CLOS dynamism can be reduced so that slot-accessors can be compiled to inlined vector references.
I think this is a very crucial point. There are dynamic languages which can be implemented efficiently (e.g., Lisp), and with others it is difficult because of fundamental features of the language. Another example is monkey patching, e.g., replacing a method of an object at runtime. Restricting such language features and thereby speeding up the interpreter would have been an ideal selling point for Python 3, in my opinion, but of course it's too late now.
the feature is maybe rarely used directly, but is crucial to make dynamic typing actually useful. Things like SQLAlchemy are possible thanks to the dynamic aspect of python. Writing a simple plugin system in python is trivial. Mocking is more or less trivial in those languages as well.
It is of cause possible to write a faster Python, as PyPy proves, but some of the design choices in Python does make it hard to optimize. This is not inherently a problem with dynamic languages as the development of Julia proves. In Julia it is easy to write programs that is in the same ballpark as C efficiency wise, while still feeling very dynamic and having a REPL. Pythons problem is that both the interpreter and the language is quite old, and existed before modern JIT technology was developed, so that was not considered when designing the language and the interpreter.
That said, Python emphasis on fast development as opposed to fast running code is very often the right tradeoff, and is why it is so popular.
The trick is often to somehow get Python to hand the "real work" off to something implemented in C, Fortran, C++, etc. That's pretty easy for a lot of numerical and scientific applications thanks to NumPy, SciPy, Pandas, PIL, and more. Lots of libraries are exposed to Python, and you can expose more either by compiling bindings (see Boost.Python) or using ctypes. If you do this well, nobody will notice the speed difference, yet your code will still look like Python.
And sometimes your performance problem is not Python's fault at all, such as this lovely performance bug: https://github.com/paramiko/paramiko/issues/175 - a 5-10x speedup can be had simply by tweaking one parameter that is no fault of CPython or the GIL or anything else (in fact, Go had basically the same bug).
What I'm getting at here is that performance is not just one thing, and the GIL is a real and worthy spectre but hardly matters for most applications where Python is relevant. Your Python is slow for the same reason that your Bash and your C and your Javascript and your VBA are slow: you haven't profiled enough, you haven't thought long and hard with empirical results in front of you about where your program is spending its time, your program is improperly factored and doing work in the wrong order, or perhaps you should just throw some hardware at it!
We only had 23 years of Python interpreter development,
how would things look like when Python is 42, like C?
C, which I always think of as an ancient venerable systems language, is less than twice as old as Python, which I think of as a hot new kid on the block.
In thirty years, when my career will probably be drawing to a close, Python will be 53 years old and C will be 72 years old. Barely any difference at all.
Who would use a dictionary for a point? If it's a 2D grid, you are only going to have two axes and by convention x-axis is always first, and y-axis is always second. Perfect use case for a tuple.
I'm curious where one would go to learn specifically about performance issues in Python - patterns to avoid, and those to stick to. Any recommendations?
The article was kind of interesting, it's always good to talk about why things are slow, since that might help highlight what you can do about it (short of switching away from Python, that is).
I was abit annoyed about the benchmarks part; there was no C-based benchmark for comparison, it was not clear how many points were being processed which made the exact time kind of pointless. Also, C and C++ are not the same which the author seems to think.
> C and C++ are not the same which the author seems to think.
The part where he lost me was when he said this referring to C:
> Here we tell the interpreter that each point would have exactly two fields, x and y.
Maybe I misread that, but I don't think I did, he was referring to the C struct when he said that. It's one of those glaring mistakes that is hard for me to look past when someone says something like this.
There is a difference between C/C++, but for the purposes of this article it does not matter -- the point I am trying to make is that you would not use a hash table in any of them, would you?
The exact times are kind of pointless indeed, all what matters are the ratios. C benchmark would probably be as well. What I was trying to show was that your runtime comparisons are not necessarily comparing the same things.
The sentence "Python run slow" is a little misleading. While is true that does more stuff than other languages (C, for example) for similar operations, it is also true that in the majority of cases, the relevant time consuming parts of execution are in other areas, like waiting for I/O, etc.
In most of the cases, a program done in Python (or Ruby, or JavaScript) is not slower than a program done in C or even assembly. Sure, for some cases, it may be necessary to take care of rough cases, but I think that just saying "Python is slow" is not very fortunate...
Rather than using a better data structure from the start (class vis-a-vis dictionaries), why not move to a better Python implementation (PyPy vis-a-vis CPython)? When I need a data structure, I look for one that is a natural fit to the problem. Perhaps everything is implemented as a hash table in a certain language (not taking names here :)) but would I replace every data structure with a hash table for efficiency? That would be Premature Optimisation.
I would recommend to code in idiomatic Python as much as possible (in this case, use a tuple for Point). Primarily because it is easy for other fellow programmers and your future self to understand. Performance optimisations are best done by the compiler for a reason, it is usually a one-way path that affects readability.
For my casual use, Python has been fast enough that I can be pretty carefree about how I program. But I had one shocking experience when I ported a program with a fair amount of math over to a Raspberry Pi, and it practically ground to a halt. That was a disappointment. It won't drive me away from Python, but it means I've got some learning to do on how to better optimize Python code.
This should be bringing up the classic array-of-structures versus structure-of-arrays debate instead of what was mentioned. If only NumPy were standard Python and so we could have an easy, standard answer that is usually fast enough.
I'm surprised that Python's interpreter tech isn't advanced enough to try and JIT some of this away. Python could probably learn a lot from various JS interpreter advances.
On a related note, the field of compilers seems to have ignored data optimization almost completely at the expense of code optimization.
Applying a few basic tricks lets a human typically save 50-75% off an object's storage requirements and a compiler could combine this with profile feedback data about access patterns & making it more cache friendly.
This would be a good fit for JIT where the compiler could do deopt when the assumptions are violated, and this would let you make more aggressive optimizations.
This article is making a misleading comparison. Yes, dynamically typed languages have overheads that statically typed languages don't. But the better question to ask is why Python implementations are factors slower than leading javascript implementations, which have the same theoretical overheads that Python does.
The V8 is is now about a factor of 4 or 5 off of the speed of C for general purpose code, while CPython is 30x to 100x.
[+] [-] jnbiche|12 years ago|reply
Coincidentally, one of the usual examples given in the ctypes howto is a Point structure, just like in this post. It's simple:
Then you can use the Point class the same way you'd use a regular Python class: Really, taking a half-day to learn how to use ctypes effectively can make a world of difference to your performance-critical Python code, when you need to stop and think about data structures. Actually, if you already know C, it's less than a half-day to learn... just an hour or so to read the basic documentation:http://docs.python.org/2/library/ctypes.html
If you plan to write an entire application using ctypes, it'd be worth looking at Cython, which is another incredible project. But for just one or two data structures in a program, ctypes is perfect.
And best of all, ctypes is included in any regular Python distribution -- no need to install Cython or any additional software. Just run your Pythons script like you normally do.
EDIT: I was just demonstrating how to get an easy-speed up using ctypes, which is included in the Python standard library. To be clear, you would usually use this type of data structure in ctypes along with a function from a C shared library.
Furthermore, if you're serious about optimizations, and you can permit additional dependencies, you should absolutely look at cython and/or numpy, both of which are much faster than ctypes, although they do bring along additional complexity. Other commenters are also pointing out cffi, which I've never used but also bears consideration.
[+] [-] absherwin|12 years ago|reply
Using slots is 0.06ms faster while namedtuple is slower.
A bigger speedup in CPython can be achieved by just using the builtin sum function twice to avoid the for loop.
Here's a gist: https://gist.github.com/absherwin/8976587
[+] [-] acdha|12 years ago|reply
http://cffi.readthedocs.org/
In the specific example above, however, I'd try to reduce that overhead by going even more C-style and allocating an array or two so you avoid the overhead of individual instances and instead have one Python object for many elements.
[+] [-] Sauliusl|12 years ago|reply
The problem with both of these approaches is that, well, use this too often, however, and you suddenly realise you are not really coding in Python anymore.
Also, correct me if I'm wrong, overuse of tricks like these is one of the key reasons why we cannot have nice things like PyPy for all modules.
[+] [-] untothebreach|12 years ago|reply
1: http://cffi.readthedocs.org/en/release-0.8/
[+] [-] walshemj|12 years ago|reply
[+] [-] erbdex|12 years ago|reply
For example, when I first wrote whisper I was convinced that it would have to be rewritten in C for speed and that my Python implementation would only serve as a prototype. If I weren't under a time-crunch I very well may have skipped the Python implementation entirely. It turns out however that I/O is a bottleneck so much earlier than CPU that the lesser efficiency of Python hardly matters at all in practice."
-- Chris Davis, ex Google, Graphite creator
[+] [-] jnbiche|12 years ago|reply
This is one reason why I'm very excited about Python's new AsyncIO module in Python 3 only. It's asyncore done right, and is a great way to write network applications. I look forward to seeing what is built on top.
[+] [-] munificent|12 years ago|reply
I'd love to have the time to examine codebases and get real data, but my strong hunch is that in Python and Ruby, most of the time every instance of a class has the exact same set of fields and methods. These languages pay a large performance penalty for all field accesses, to enable a rare use case.
JavaScript VMs (and maybe Ruby >=1.9?) don't pay the perf cost because they do very advanced optimizations ("hidden classes" in V8 et. al.) to handle this. But then they pay the cost of the complexity of implementing that optimization.
I'm working on a little dynamically-typed scripting language[1] and one design decision I made was to make the set of fields in an object statically-determinable. Like Ruby, it uses a distinct syntax for fields, so we can tell just by parsing the full set a class uses.
My implementation is less than 5k lines of not-very-advanced C code, and it runs the DeltaBlue[2] benchmark about three times faster than CPython.
People think you need static types for efficiency, but I've found you can get quite far just having static shape. You can still be fully dynamically-dispatched and dynamically type your variables, but locking down the fields and methods of an object simplifies a lot of things. You lose some metaprogramming flexibility, but I'm interesting in seeing how much of a trade-off that really is in practice.
[1] https://github.com/munificent/wren [2] https://github.com/munificent/wren/blob/master/benchmark/del...
[+] [-] gopalv|12 years ago|reply
class vertex(object):
I use __slots__ in most code after the first refactor for a very important reason - I'm a hyper-caffeinated code monkey.I wasted an entire day because I set self.peices in one place and self.pieces in another & couldn't figure out the bug, until I used the "grep method".
[+] [-] lispm|12 years ago|reply
[+] [-] andreasvc|12 years ago|reply
[+] [-] cdavid|12 years ago|reply
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] alephnil|12 years ago|reply
That said, Python emphasis on fast development as opposed to fast running code is very often the right tradeoff, and is why it is so popular.
[+] [-] pjmlp|12 years ago|reply
With proper implementations and correct use of data structures, there are seldom reasons to use C on user space applications.
[+] [-] jl6|12 years ago|reply
[+] [-] jzwinck|12 years ago|reply
The trick is often to somehow get Python to hand the "real work" off to something implemented in C, Fortran, C++, etc. That's pretty easy for a lot of numerical and scientific applications thanks to NumPy, SciPy, Pandas, PIL, and more. Lots of libraries are exposed to Python, and you can expose more either by compiling bindings (see Boost.Python) or using ctypes. If you do this well, nobody will notice the speed difference, yet your code will still look like Python.
And sometimes your performance problem is not Python's fault at all, such as this lovely performance bug: https://github.com/paramiko/paramiko/issues/175 - a 5-10x speedup can be had simply by tweaking one parameter that is no fault of CPython or the GIL or anything else (in fact, Go had basically the same bug).
What I'm getting at here is that performance is not just one thing, and the GIL is a real and worthy spectre but hardly matters for most applications where Python is relevant. Your Python is slow for the same reason that your Bash and your C and your Javascript and your VBA are slow: you haven't profiled enough, you haven't thought long and hard with empirical results in front of you about where your program is spending its time, your program is improperly factored and doing work in the wrong order, or perhaps you should just throw some hardware at it!
[+] [-] crntaylor|12 years ago|reply
In thirty years, when my career will probably be drawing to a close, Python will be 53 years old and C will be 72 years old. Barely any difference at all.
[+] [-] joshmaker|12 years ago|reply
[+] [-] kashif|12 years ago|reply
[+] [-] mjschultz|12 years ago|reply
[+] [-] NigelTufnel|12 years ago|reply
[+] [-] adnam|12 years ago|reply
[+] [-] tostitos1979|12 years ago|reply
[+] [-] ramchip|12 years ago|reply
[+] [-] freyrs3|12 years ago|reply
If you didn't measure it then how did you come to believe this?
[+] [-] a13xnet|12 years ago|reply
[+] [-] unwind|12 years ago|reply
I was abit annoyed about the benchmarks part; there was no C-based benchmark for comparison, it was not clear how many points were being processed which made the exact time kind of pointless. Also, C and C++ are not the same which the author seems to think.
[+] [-] craigching|12 years ago|reply
The part where he lost me was when he said this referring to C:
> Here we tell the interpreter that each point would have exactly two fields, x and y.
Maybe I misread that, but I don't think I did, he was referring to the C struct when he said that. It's one of those glaring mistakes that is hard for me to look past when someone says something like this.
[+] [-] Sauliusl|12 years ago|reply
There is a difference between C/C++, but for the purposes of this article it does not matter -- the point I am trying to make is that you would not use a hash table in any of them, would you?
The exact times are kind of pointless indeed, all what matters are the ratios. C benchmark would probably be as well. What I was trying to show was that your runtime comparisons are not necessarily comparing the same things.
[+] [-] shultays|12 years ago|reply
[+] [-] jaimebuelta|12 years ago|reply
In most of the cases, a program done in Python (or Ruby, or JavaScript) is not slower than a program done in C or even assembly. Sure, for some cases, it may be necessary to take care of rough cases, but I think that just saying "Python is slow" is not very fortunate...
[+] [-] arocks|12 years ago|reply
I would recommend to code in idiomatic Python as much as possible (in this case, use a tuple for Point). Primarily because it is easy for other fellow programmers and your future self to understand. Performance optimisations are best done by the compiler for a reason, it is usually a one-way path that affects readability.
[+] [-] GFK_of_xmaspast|12 years ago|reply
[+] [-] analog31|12 years ago|reply
[+] [-] tgb|12 years ago|reply
[+] [-] pekk|12 years ago|reply
[+] [-] rtpg|12 years ago|reply
[+] [-] jaimebuelta|12 years ago|reply
[+] [-] zurn|12 years ago|reply
Applying a few basic tricks lets a human typically save 50-75% off an object's storage requirements and a compiler could combine this with profile feedback data about access patterns & making it more cache friendly.
This would be a good fit for JIT where the compiler could do deopt when the assumptions are violated, and this would let you make more aggressive optimizations.
[+] [-] tasty_freeze|12 years ago|reply
The V8 is is now about a factor of 4 or 5 off of the speed of C for general purpose code, while CPython is 30x to 100x.
[+] [-] hessenwolf|12 years ago|reply
[+] [-] rplnt|12 years ago|reply
[+] [-] mumrah|12 years ago|reply
[+] [-] minimax|12 years ago|reply