top | item 1864591

Why Python rocks for research

210 points| agconway | 15 years ago |stat.washington.edu | reply

98 comments

order
[+] miloshasan|15 years ago|reply
Python is awfully close to being a superior (and free) replacement for Matlab, but there are a few annoyances that keep preventing me from switching forever. Unfortunately, these are mostly not bugs but bad design that is believed to be correct by the core developers, so it is unlikely to ever change:

- Matrices are a pain. The r_[] and c_[] operators could be a reasonable replacement for Matlab's elegant matrix construction syntax, but they do not work as expected (as smart hstack and vstack), instead doing something completely different and inconsistent for vectors and matrices.

- Tensors are a bigger pain. Matlab has a very well-defined semantics for operations like permute and reshape; in NumPy these operations sometimes create just a view, at other times they reshuffle the memory contents. I know the idea was to "protect" the user from having to know the memory layout of data, but this idea is bad.

- Ipython is great in every way except when it comes to reloading parts of your program. After any tiny change to your code, the only safe thing to do is to quit ipython and start it again. All the other options (run, reset, reload...) make some secret and wrong assumptions on what you want to reload. In contrast, this works flawlessly in Matlab.

[+] apl|15 years ago|reply
In the end, it's all about the ecosystem. Perl wins for bioinformatics because there are boatloads of scientists already using it, with all the neat libraries and resources that brings. Equally, Python wins for, say, prototyping in robotics because of libraries, support and so on.

There's nothing intrinsically science-apt about Python/Perl, but Ruby and friends can't compete when it comes to the programming environment; that's what counts.

[+] cageface|15 years ago|reply
As a language I prefer Ruby but the Python ecosystem for this kind of thing is definitely a huge advantage. You can actually do quite a bit with Ruby + GSL but it's still not really competitive.
[+] aero142|15 years ago|reply
Python is very accessible to casual programmers, so this might be one of the reasons it has been adopted by the scientific community.
[+] RodgerTheGreat|15 years ago|reply
This article makes some fairly convincing arguments that Python is a more flexible tool than Matlab or Perl, but I can't help but come away with the sense that the author hasn't tried many other languages.

There are an awful lot of languages that provide iterators, a powerful set of data structures, extensive libraries and facilities for structuring and maintaining large codebases. .Net languages (maybe F# would be good for this?), Java or most of the emerging languages for the JVM stack, Ruby (which is generally considered to be "different but equivalent" to Python), and so forth.

[+] mechanical_fish|15 years ago|reply
I only scanned the article very briefly, but my impression is that the important comparison is vs. Matlab. The other players aren't really in the author's game. It's a question of the use case and the community and the library support.

In theory, .Net could displace Matlab or Python as the canonical platform for scientific researchers. And in theory Python could displace PHP as the canonical platform for classic CRUD web apps. In practice neither is likely to happen, no matter how much we might or might not wish it to.

[+] hyperbovine|15 years ago|reply
It's all about Numpy. Mayavi, IPython, mlab & friends are great for when I need to plot things or look at data, but Numpy is the workhorse I keep coming back to day in and day out. And also, the thing I wish for most when I have to use other languages. The combination of the speed of C and the elegance of, well, anything that is not C, is hard to beat. Once you get down the basics of array broadcasting, types, etc., it's possible to do some amazingly elegant things in Numpy, and quickly too. The numpy library has seemingly every array function I have ever wanted. If I were Matlab I'd be scared :-)
[+] bad_user|15 years ago|reply
SciPy, NumPy, Matplotlib, Cython, Sphinx, PIL. The article should've also mentioned NLTK.

IMHO, languages matter less than the available libraries, and in my experience only Java matches the depth of the Python ecosystem.

That Python is a nice language to work with, that's just a bonus.

[+] vog|15 years ago|reply
> There are an awful lot of languages that provide ...

Still, how many of them have a fast interactive interpreter ("command line") with a decent usability? How many of those provide good libraries for numerical as well as symbolic math? With an API that is easy to write, to understand and to extend?

Python may not be the only language with those qualities, but there aren't many languages (and ecosystems around them) which can compete on all those areas.

Python seems to be one one the few "best fits" for scientific applications.

[+] wwortiz|15 years ago|reply
Do they all have equivalent libraries to python's scipy and matplotlib? I think that is why the author could move to python as these provide a pretty large subset of what matlab has and makes the transition less problematic.

Java probably does and .NET may have something like this but I don't know of them or their amount of documentation.

[+] crocowhile|15 years ago|reply
None of the languages you mention is easy enough to be grasped in a week from someone who never did programming in their life. Believe me: most of the people who suddenly has to do data analysis for their phd or postdoc projects hardly know how to use excel, so easy of use is essential.

Also, you want a non compiled language. Most of the time you do interactive programming and change parameters on the fly, according of the result of the analysis.

Finally, matplotlib, one of python's most complete graphic library, is a breeze to use. Making graphs in an interactive way with java or .net is simply impossible.

I am a neuroscientist and most people in the field use Matlab. I use python (in fact I use ONLY open source software, by choice). It's amazing how many advantages python gave me on my daily life.

[+] variety|15 years ago|reply
actually, he doesn't make any assertions that Python is more flexible than Perl (which would be rather doubtful), only that it is more readable (which, as a perlista, I'm sad to say is probably true).

but I also get the sense that this is the first time he's seriously delved into a dynamic programming language. much of what he's saying about Python is exactly what bioinformaticists were saying about Perl in the late 90s / early naughts.

[+] ogrisel|15 years ago|reply
And the corollary: Why do researchers never respect the PEP8 when they write python code?

Yes I am a bit overreacting since the blog post is very well written and I actually agree 100% with the content. But please people: respect the PEP8 [1]. It makes your readers feel at home while reading your code. It is very important if you want to get new contributors to your project. See [2] for instance.

[1] http://www.python.org/dev/peps/pep-0008/ [2] http://www.dataists.com/2010/10/whats-the-use-of-sharing-cod...

[+] hogu|15 years ago|reply
I wasn't aware of pep8 when I started, most science people arrive at python from a different path. What I mean is, for a long time I knew much more about numpy than about python itself.

there are some things in pep8 that are bad for science, the spaces around operations, and also the 80 chars to a line... scientific expressions are often long and complicated, yes you can do it while adhering to pep8, but its kind of a PITA

[+] Avshalom|15 years ago|reply
Because researchers have never heard of pep8, and in general don't give a shit about domain specific politics unless it's their domain.
[+] BrandonM|15 years ago|reply
From PEP 8:

> The preferred place to break around a binary operator is after the operator, not before it.

I'd be interested in hearing the justification for this rule. I think that leading a continuation line with the binary operator makes it super-clear that it is a continuation line. What is the benefit of the preferred style? Compare:

  if (the_result_of_this_function(on_this_arg) == 10
      and this_overly_descriptive_boolean):
      do_stuff()

  if (the_result_of_this_function(on_this_arg) == 10 and
      this_overly_descriptive_boolean):
      do_stuff()
To me, the first one is quite clearly a continuation line (no statement can start with "and"). The second requires closer inspection.
[+] njharman|15 years ago|reply
PEP8 is wrong on several counts. It even understands this the first section (after introduction) is "A Foolish Consistency is the Hobgoblin of Little Minds" which is about the spirit of pep8 readability and consistency and explains some situations when you should violate pep8.
[+] samd|15 years ago|reply
I don't think most researchers ever expect anybody to read their code. Woe to the graduate student who years later actually needs to use the code.
[+] killedbydeath|15 years ago|reply
I worked in projects where different used slightly different coding styles and I did not find it getting in the way too much -- you just match the style of the code you are working with. I am surprised there are people who will not contribute to a project because of this.
[+] uriel|15 years ago|reply
This is one of the many reasons I love Go, gofmt takes care of almost all the silly style issues, and there is no need to learn any style guide, just run your code (or anyone's code) through gofmt, and you are done.
[+] rdouble|15 years ago|reply
I've wasted most of my professional life tweaking various unix software to make it work. However, the typical scientific python setup proved to be too frustrating to install on OSX. The recommended solution is to just buy the Enthought distro. If I'm paying for software anyway, why is Enthought better than Matlab?
[+] hogu|15 years ago|reply
disclaimer I work for enthought

I did my whole phd in matlab.

EPD is much cheaper and is free for academics

even if it weren't free, I would use it anyways.

but it really isn't why is EPD better than matlab, it's why python is better than matlab. matlab is a domain specific application with a domain specific language. It doesn't work well with things outside of its domain.

python is a general purpose language (And as such, has good general purpose constructs) but it happens to have excellent scientific and mathematical libraries. This is useful when you actually have to apply your research and build an application.

numpy is also better for large data, because slicing arrays does not create copies of them (you can make it do so if you want to, but it doesn't by default) in matlab, slicing large arrays can cause you to run out of memory.

Cython makes it really easy to start out with python, and then optimize your code down into C.

with python you can run your calculations over a massive compute grid. Use messaging libraries like PyZMQ to distribute your data and result, and build real time GUIs to consume the final results.

- a matlab cluster is quite expensive

- chacko - another enthought python library which is free and open source is great for real time datavisualization, matlab does not have anything equivalent.

- python has a large number of messaging libraries, with matlab I think you're stuck with MPI.

Matlab always made me feel limited. I would work on a problem, and then reach a point where Matlab could not do what I needed to do.

That rarely happens to me with python.

[+] wgrover|15 years ago|reply
I've never had any problems putting a research-grade python setup on OSX - just use the .dmg installer files files available for python (2.6.X for compatibility), numpy, scipy, and matplotlib. But I agree that the Enthought Python Distro is also a good alternative (if a little bloated for my needs), and it's also free for academics.
[+] lliiffee|15 years ago|reply
Another option is to just install Sage. It "just works" to install. Though it is less about numerical computation than symbolic computation. (Though both are targets, there isn't really equal focus, in my opinion.)
[+] bwooceli|15 years ago|reply
I learned Python on the fly specifically for research. I used Django to build out an enterprise reporting/analytics system to support a customer experience (survey) program for the cost of time (huzzah open source). We had bids on this project upwards of 80k. We generate ~100k surveys / month and are able to get targeted, meaningful, automated insights directly to front-line management. Python FTW.
[+] shill|15 years ago|reply
Python + Django FTW.
[+] agconway|15 years ago|reply
Python rocks, but Python + R + bash rocks way harder for research
[+] levesque|15 years ago|reply
R is definitely powerful and a good part of any scientific data analysis toolkit.

I use python, ipython, matplotlib, numpy and R. I call my R scripts directly from python using rpy.

[+] sintesoro|15 years ago|reply
Python is good, you could also consider Maxima.

A single example:

f(x):= x^2+3x+7;

Maxima provides: Symbolic computation, blas and laplack integration for numeric algebra, 500 pages manual in several languages, a complete library for statistics, differential equation, calculus, series. Graphics with matplotllib. Also maxima language is not much complicate that python:

for i in range(10):print ii versus for i:0 thru 9 do print ii;

[i2 for i in range(10)] versus makelist(i*2,i,0,9)

But Matlab libraries are greater than python and maxima.

[+] kwantam|15 years ago|reply
Maxima rocks for symbolic math. I prefer it to Mathematica, which is saying a lot. In contrast, octave always feels like "almost-Matlab" and I still prefer the latter.

Also see wxMaxima, which will (among many other things) produce LaTeX for you.

[+] dagw|15 years ago|reply
Python has sage and sympy for doing symbolic math. Although admittedly they're quite primitive compared to maple and mathematica (haven't used Maxima, so I can't really compare)
[+] maurits|15 years ago|reply
I feel this article is somewhat unbalanced in its single minded rejoice for a certain tool/environment. So in the same spirit here come a couple of reasons not to switch from Matlab to Python, all stemming from my experience when I decided to try to switch from Matlab to Python/C

- installing all these packages on (any) system is painful. Different versions don't play together or don't work (yet) on some platform and or architecture. This stems from my own experience of getting a version of python to work with numpy, scipy, matplotlib, opencv and PIL on a windows, mac, and linux machine. No 100 percent success yet on any platform.

- central and consistent documentation. Even for very simple cases, I got a bit of a headache. I encounter a python print statement for the first time that obviously differs somewhat from its c printf cousin. I google "python print syntax" only to find that the first xx hits, including the official documentation, do not cover the full specification of this statement. I fear the moment I might actually need detailed information on something less trivial.

- Numerical integration is more accurate in Matlab.

- Visualization capabilities of matlab are more powerful. But who knows, perhaps there is yet another package floating around :-)

- Matlab may not have advanced data-structures, but it is a rapid prototyping tool, for testing ideas. If I need to write an actual application, I will use a tool and language geared for that task.

[+] xiongchiamiov|15 years ago|reply
> central and consistent documentation. Even for very simple cases, I got a bit of a headache. I encounter a python print statement for the first time that obviously differs somewhat from its c printf cousin. I google "python print syntax" only to find that the first xx hits, including the official documentation, do not cover the full specification of this statement. I fear the moment I might actually need detailed information on something less trivial.

http://docs.python.org/reference/simple_stmts.html#the-print... lists eveything about `print` - it takes stuff in, converts 'em to strings, and sticks in on stdout. It doesn't refer to prinf-like formatting because that's for strings in general. If you weren't aware of this, you probably should have been going through a basic Python tutorial, rather than just jumping into the middle of things.

Python's documentation is the best I've encountered so far, and I find good docs to be an important value in the community, as well. I guess YMMV, though.

[+] hogu|15 years ago|reply
install is painful - enthought python distribution does make it pretty painless, but its not free for non-academics

agreed on documentation

actually I think python's visualization capabilities are more powerful, have you looked at mlab? the 3d capabilities there are insane

I use python because I can do rapid prototyping, and turn it into a full application with the same code base.

did you ultimately go back to matlab?

[+] woodson|15 years ago|reply
Often languages are chosen based on already existing tools used in a specific research project. That's why it's good to be able to quickly pick up new languages. For example scripting languages integrated in some frameworks, like Scheme in the Festival speech synthesis system. In the end, this often results in projects involving things like Python, Scheme, bash, R and a bit of tcl ;-).
[+] b_emery|15 years ago|reply
> In MatLab everything is flat – all functions are declared in the global namespace. However, this discourages code reusability by making the programmer do extra work keeping disparate program components separate. In other words, without a hierarchical structure to the program, it’s difficult to extract and reuse specific functionality.

I completely disagree. Reusable Matlab code has been my holy grail for the last couple of years. The key is to break out specific functionality as subfunctions. When these are abstracted and generally useful elsewhere, then they become new tools for the toolbox. The subfunctions also make great starting points for repurposing code. This layout results in much less work.

[+] tel|15 years ago|reply
It's absolutely true that using more functions makes your code more maintainable. It's also absolutely true that nearly every other language on the planet does this better than Matlab.
[+] ansgri|15 years ago|reply
One more powerful combo is R+Java+C. UI, integration and massive data processing in Java, R for prototyping, plotting and model fitting, and C if you have numerical simulations.