I just want to point this out because I feel like there's a good chance a lot of people won't have gotten this far:
Because our implementation does not explicitly depend on Python we are able to overcome many of the shortcomings of the Python runtime such as running without the GIL and utilising real threads to dispatch custom Numba kernels running at near C speed without the performance limitations of Python.
Yes, using Numba we can just-in-time compile numeric Python logic straight down to machine code, so naturally we can achieve some pretty impressive numbers on kernel execution.
In case many people didn't reach the bottom here are the links to the repo and the docs. The project is still in early stages, but is public and released under a BSD license.
I taught High Performance Python covering the tools you mention at PyCon 2012 (and EuroPython last year), maybe my videos+write-up will be helpful. I also cover profiling, shedskin, pyCUDA etc:
Depends on your application. Ideally you want to change your code so as much computation as possible can happen in pure-C code and pure-C data types (using Cython). If you have a big class tree with many callbacks and work spread over hundreds of method, that can be difficult.
Before you go that far, I'd recommend making sure you know all the Python gotchas (for example, maybe you have some inner loop that does for x in range(100000) all the time), that you algorithms are in order. Sometimes even silly microoptimization can make a difference if a small function is a significant amount of your runtime. Using multiple processes with e.g. the multiprocessing module can be an option too.
Depending on what data types you operate on, numpy (and now this new thing) can do some amazing things.
Cython is helpful, but you have to spell out a lot of type-information that is not specifically necessary. You might also try Numba --- easiest way to get it is via Anaconda CE or Wakari. Both at http://continuum.io
Cython is good, but sometimes it's a bit tricky to bend it to do exactly what you want[1]. You'll probably still want to write that hot piece in C... But gluing it with Cython is IMHO much nicer than using the plain Python API.
[1] - on the other hand, it comes with a tool explaining exactly how each of your lines of Cython looks in resulting C, with color-coding for high level overview of which pieces translated smoothly
I would go with C/C++ as the ways to address performance are well studied. There are many tools out there like callgrind or nvvp that will make it pain-free.
I can narrow down performance in C/C++ quite quickly, but neither I nor anybody I know has done much of this for Python. Many people who I work with consider a Python implementation a prototype, while Fortran/C/C++ is mature real code worthy of attention.
The only real downside is that C/C++ requires a little knowledge of the POSIX/LINUX or Windows. This represents a learning curve, but when you are over it, it represents quite durable long lasting skills.
So is there anyone using Python for machine learning in production systems (i.e. not just for prototyping). I would love to do it but seems Java/Mahout is a safer choice, performance-wise.
I wonder whether Blaze is a step towards that direction.
I use Python for nearly all of my ETL processes that involve text processing. Even in production systems, I'd be hard-pressed to admit any significant performance issues. Python facilitates implementing algorithms in a functional style, which I tend to prefer over the imperative style (i.e., Java). With C++11 and boost, I'm able to translate my Python code to C++ while preserving the functional style, which has immensely simplified prototyping/deploying NLP/ML algorithms while simultaneously begetting enormous performance gains. I see Python as an extremely viable alternative to Java.
We also use python in production at plotwatt for machine learning. We started by prototyping in matlab and then porting to c++, but have since found it much much easier to just do everything in python and numpy. When speed was an issue, we slightly changed the way we implemented the algorithm rather than implement the same algorithm in a faster language. Admittedly this isn't always possible.
It would be great to eventually have a GPU version as well (as in the cases of Matlab and R). I saw a brief demo of Matlab on a Mac Retina Pro 15 where the GPU version ran 30x the CPU version.
I read about continuum after the fellow who developed numpy left a few days ago to work on continuum. I am curious to see actual projects using continuum. So some sort of writeups.
in general, i like (ie i don't see a better solution than) the idea of having an AST constructed via an embedded language that is implemented by a library. but it does have downsides - integration with other python features is going to be much more limited (it gives the illusion of a python solution, but in practice you're off in some other world that only looks like python).
are there more details? i guess the AST is fed to something that does the work. and that something will have an API and be replaceable. but is that something also composable? does it have, say, a part related to moving data and another to evaluating data? so that you can combine "distributed across local machines" with "evaluate on GPU"?
> how does this compare to theano? it seems like some of the ideas are similar?
It's quite similar, we just take some of the ideas farther and try to generalize the data storage to include storage backends that data scientists use more frequently ( i.e. SQL, CSV, S3, etc ). We're very friendly with the Theano developers and hope to bridge the projects with a compatibility layer at some point.
> (it gives the illusion of a python solution, but in practice you're off in some other world that only looks like python).
I would argue that's what make Python a great numeric language, and NumPy so succesfull. You get this high level language where you can express domain knowledge but also this 1:1 mapping between fast code execution at the C level. Blaze is the continuation of that vision
> i guess the AST is fed to something that does the work. and that something will have an API and be replaceable.
Precisely, we build up a intermediate form called ATerm out of the construction expression objects, do type inference, graph rewriting, and then pattern match our layout, metadata, and type information against a number of backends to find the most optimal one to perform execution. Or if needed we build a custom kernel with Numba informed by all this type and data layout information we've inferred from the graph.
We don't aim to solve all the subproblems in this area ( expression optimization passes, distributed scheduling ) but I think we have a robust enough system that others can build extensions to Blaze to do expression evaluation in whatever fashion they like.
ezl|13 years ago
Because our implementation does not explicitly depend on Python we are able to overcome many of the shortcomings of the Python runtime such as running without the GIL and utilising real threads to dispatch custom Numba kernels running at near C speed without the performance limitations of Python.
freyrs3|13 years ago
In case many people didn't reach the bottom here are the links to the repo and the docs. The project is still in early stages, but is public and released under a BSD license.
* http://blaze.pydata.org/docs/
* https://github.com/ContinuumIO/blaze
rpm4321|13 years ago
I'm starting to run into some performance bottlenecks with Python, and so I'm just now looking at Cython, PyPy, Psyco, and... gasp... C.
From what little I've read, Cython is supposed to be as easy as adding some typing and modifying a few loops here and there, and you are in business.
IanOzsvald|13 years ago
http://ianozsvald.com/2012/03/18/high-performance-python-1-f...
Erwin|13 years ago
Before you go that far, I'd recommend making sure you know all the Python gotchas (for example, maybe you have some inner loop that does for x in range(100000) all the time), that you algorithms are in order. Sometimes even silly microoptimization can make a difference if a small function is a significant amount of your runtime. Using multiple processes with e.g. the multiprocessing module can be an option too.
Depending on what data types you operate on, numpy (and now this new thing) can do some amazing things.
PS: check things like http://packages.python.org/line_profiler/ beyond the ordinary profiling.
travisoliphant|13 years ago
chadillac83|13 years ago
http://benchmarksgame.alioth.debian.org/u32/which-programs-a...
http://en.wikipedia.org/wiki/Lua_(programming_language)#C_AP...
lrem|13 years ago
[1] - on the other hand, it comes with a tool explaining exactly how each of your lines of Cython looks in resulting C, with color-coding for high level overview of which pieces translated smoothly
frozenport|13 years ago
I can narrow down performance in C/C++ quite quickly, but neither I nor anybody I know has done much of this for Python. Many people who I work with consider a Python implementation a prototype, while Fortran/C/C++ is mature real code worthy of attention.
The only real downside is that C/C++ requires a little knowledge of the POSIX/LINUX or Windows. This represents a learning curve, but when you are over it, it represents quite durable long lasting skills.
greenonion|13 years ago
I wonder whether Blaze is a step towards that direction.
law|13 years ago
dwiel|13 years ago
davidf18|13 years ago
freyrs3|13 years ago
[1] http://www.ustream.tv/recorded/26973799
[2] https://store.continuum.io/cshop/numbapro
Caligula|13 years ago
omni|13 years ago
andrewcooke|13 years ago
http://deeplearning.net/software/theano/
in general, i like (ie i don't see a better solution than) the idea of having an AST constructed via an embedded language that is implemented by a library. but it does have downsides - integration with other python features is going to be much more limited (it gives the illusion of a python solution, but in practice you're off in some other world that only looks like python).
are there more details? i guess the AST is fed to something that does the work. and that something will have an API and be replaceable. but is that something also composable? does it have, say, a part related to moving data and another to evaluating data? so that you can combine "distributed across local machines" with "evaluate on GPU"?
freyrs3|13 years ago
It's quite similar, we just take some of the ideas farther and try to generalize the data storage to include storage backends that data scientists use more frequently ( i.e. SQL, CSV, S3, etc ). We're very friendly with the Theano developers and hope to bridge the projects with a compatibility layer at some point.
> (it gives the illusion of a python solution, but in practice you're off in some other world that only looks like python).
I would argue that's what make Python a great numeric language, and NumPy so succesfull. You get this high level language where you can express domain knowledge but also this 1:1 mapping between fast code execution at the C level. Blaze is the continuation of that vision
> i guess the AST is fed to something that does the work. and that something will have an API and be replaceable.
Precisely, we build up a intermediate form called ATerm out of the construction expression objects, do type inference, graph rewriting, and then pattern match our layout, metadata, and type information against a number of backends to find the most optimal one to perform execution. Or if needed we build a custom kernel with Numba informed by all this type and data layout information we've inferred from the graph.
We don't aim to solve all the subproblems in this area ( expression optimization passes, distributed scheduling ) but I think we have a robust enough system that others can build extensions to Blaze to do expression evaluation in whatever fashion they like.
> are there more details?
Yes! See: http://blaze.pydata.org/
lucian1900|13 years ago
piqufoh|13 years ago
rerere|13 years ago
[deleted]