A Python Compiler for Big Data

ezl|13 years ago

I just want to point this out because I feel like there's a good chance a lot of people won't have gotten this far:

Because our implementation does not explicitly depend on Python we are able to overcome many of the shortcomings of the Python runtime such as running without the GIL and utilising real threads to dispatch custom Numba kernels running at near C speed without the performance limitations of Python.

freyrs3|13 years ago

Yes, using Numba we can just-in-time compile numeric Python logic straight down to machine code, so naturally we can achieve some pretty impressive numbers on kernel execution.

In case many people didn't reach the bottom here are the links to the repo and the docs. The project is still in early stages, but is public and released under a BSD license.

* http://blaze.pydata.org/docs/

* https://github.com/ContinuumIO/blaze

rpm4321|13 years ago

Bit of a tangent, but I'm wondering if anyone here has had any luck with Cython?

I'm starting to run into some performance bottlenecks with Python, and so I'm just now looking at Cython, PyPy, Psyco, and... gasp... C.

From what little I've read, Cython is supposed to be as easy as adding some typing and modifying a few loops here and there, and you are in business.

IanOzsvald|13 years ago

I taught High Performance Python covering the tools you mention at PyCon 2012 (and EuroPython last year), maybe my videos+write-up will be helpful. I also cover profiling, shedskin, pyCUDA etc:

http://ianozsvald.com/2012/03/18/high-performance-python-1-f...

Erwin|13 years ago

Depends on your application. Ideally you want to change your code so as much computation as possible can happen in pure-C code and pure-C data types (using Cython). If you have a big class tree with many callbacks and work spread over hundreds of method, that can be difficult.

Before you go that far, I'd recommend making sure you know all the Python gotchas (for example, maybe you have some inner loop that does for x in range(100000) all the time), that you algorithms are in order. Sometimes even silly microoptimization can make a difference if a small function is a significant amount of your runtime. Using multiple processes with e.g. the multiprocessing module can be an option too.

Depending on what data types you operate on, numpy (and now this new thing) can do some amazing things.

PS: check things like http://packages.python.org/line_profiler/ beyond the ordinary profiling.

travisoliphant|13 years ago

Cython is helpful, but you have to spell out a lot of type-information that is not specifically necessary. You might also try Numba --- easiest way to get it is via Anaconda CE or Wakari. Both at http://continuum.io

chadillac83|13 years ago

LUA might be good for this too, it's pretty fast on it's own but from what I've read (not tested mind you) their C API is supposed to be pretty great.

http://benchmarksgame.alioth.debian.org/u32/which-programs-a...

http://en.wikipedia.org/wiki/Lua_(programming_language)#C_AP...

lrem|13 years ago

Cython is good, but sometimes it's a bit tricky to bend it to do exactly what you want[1]. You'll probably still want to write that hot piece in C... But gluing it with Cython is IMHO much nicer than using the plain Python API.

[1] - on the other hand, it comes with a tool explaining exactly how each of your lines of Cython looks in resulting C, with color-coding for high level overview of which pieces translated smoothly

frozenport|13 years ago

I would go with C/C++ as the ways to address performance are well studied. There are many tools out there like callgrind or nvvp that will make it pain-free.

I can narrow down performance in C/C++ quite quickly, but neither I nor anybody I know has done much of this for Python. Many people who I work with consider a Python implementation a prototype, while Fortran/C/C++ is mature real code worthy of attention.

The only real downside is that C/C++ requires a little knowledge of the POSIX/LINUX or Windows. This represents a learning curve, but when you are over it, it represents quite durable long lasting skills.

greenonion|13 years ago

So is there anyone using Python for machine learning in production systems (i.e. not just for prototyping). I would love to do it but seems Java/Mahout is a safer choice, performance-wise.

I wonder whether Blaze is a step towards that direction.

law|13 years ago

I use Python for nearly all of my ETL processes that involve text processing. Even in production systems, I'd be hard-pressed to admit any significant performance issues. Python facilitates implementing algorithms in a functional style, which I tend to prefer over the imperative style (i.e., Java). With C++11 and boost, I'm able to translate my Python code to C++ while preserving the functional style, which has immensely simplified prototyping/deploying NLP/ML algorithms while simultaneously begetting enormous performance gains. I see Python as an extremely viable alternative to Java.

dwiel|13 years ago

We also use python in production at plotwatt for machine learning. We started by prototyping in matlab and then porting to c++, but have since found it much much easier to just do everything in python and numpy. When speed was an issue, we slightly changed the way we implemented the algorithm rather than implement the same algorithm in a faster language. Admittedly this isn't always possible.

davidf18|13 years ago

It would be great to eventually have a GPU version as well (as in the cases of Matlab and R). I saw a brief demo of Matlab on a Mac Retina Pro 15 where the GPU version ran 30x the CPU version.

freyrs3|13 years ago

GPU support is definitely planned and already supported in NumbaPro[2]. Here's a video of Travis Oliphant's talk about targeting CUDA through Numba:

[1] http://www.ustream.tv/recorded/26973799

[2] https://store.continuum.io/cshop/numbapro

Caligula|13 years ago

I read about continuum after the fellow who developed numpy left a few days ago to work on continuum. I am curious to see actual projects using continuum. So some sort of writeups.

omni|13 years ago

You're being downvoted because Travis Oliphant, the original author of Numpy, is also a co-founder of Continuum Analytics.

andrewcooke|13 years ago

how does this compare to theano? it seems like some of the ideas are similar?

http://deeplearning.net/software/theano/

in general, i like (ie i don't see a better solution than) the idea of having an AST constructed via an embedded language that is implemented by a library. but it does have downsides - integration with other python features is going to be much more limited (it gives the illusion of a python solution, but in practice you're off in some other world that only looks like python).

are there more details? i guess the AST is fed to something that does the work. and that something will have an API and be replaceable. but is that something also composable? does it have, say, a part related to moving data and another to evaluating data? so that you can combine "distributed across local machines" with "evaluate on GPU"?

freyrs3|13 years ago

> how does this compare to theano? it seems like some of the ideas are similar?

It's quite similar, we just take some of the ideas farther and try to generalize the data storage to include storage backends that data scientists use more frequently ( i.e. SQL, CSV, S3, etc ). We're very friendly with the Theano developers and hope to bridge the projects with a compatibility layer at some point.

> (it gives the illusion of a python solution, but in practice you're off in some other world that only looks like python).

I would argue that's what make Python a great numeric language, and NumPy so succesfull. You get this high level language where you can express domain knowledge but also this 1:1 mapping between fast code execution at the C level. Blaze is the continuation of that vision

> i guess the AST is fed to something that does the work. and that something will have an API and be replaceable.

Precisely, we build up a intermediate form called ATerm out of the construction expression objects, do type inference, graph rewriting, and then pattern match our layout, metadata, and type information against a number of backends to find the most optimal one to perform execution. Or if needed we build a custom kernel with Numba informed by all this type and data layout information we've inferred from the graph.

We don't aim to solve all the subproblems in this area ( expression optimization passes, distributed scheduling ) but I think we have a robust enough system that others can build extensions to Blaze to do expression evaluation in whatever fashion they like.

> are there more details?

Yes! See: http://blaze.pydata.org/

lucian1900|13 years ago

Interesting approach to modelling data that lives elsewhere, in fact quite similar to SQLAlchemy's.

piqufoh|13 years ago

... but you can't use numpy operations efficiently on SQLAlchemy data

rerere|13 years ago

[deleted]

34 comments