Using D and std.ndslice as a Numpy Replacement

[+] RogerL|10 years ago|reply

Does D have code for: plotting, optimization, probability distributions, machine learning, Fourier transformations, masked arrays, finanial calculations, structured arrays (read a CSV from disk, get named columns based on the header), SVD, QR and Cholesky decomposition, eigens, least squares, Levenberg Marquardt, matrix inverse and pseudoinverses, integration, Runge Kutta, interpolation, bsplines, fft convolves, multidimensional images, KDTrees, symbolic equation solvers, merge/join of data sets, etc.?

Because I use almost all of these every single day (I don't do multidimensional images or b-splines much at all). Are those all in standard libraries, fully documented, backed by 60 year old, fully debugge code (LAPACK, etc), that I reliably email to anyone across the world and they can immediately run and modify my code because it is such a standard? I honestly don't know, but I'm guessing not.

I use Python/Numpy/Scipy/Pandas/Matplotlib because everyone else in the world knows and uses them; they are a standard. Yes, my np.mean() might be slower than your map(). I almost always don't care. That misses the forest for the trees.

The article might be a good argument for why library writers might consider building out D's standard library to support numerical computation, I dunno. But no one is going to use D for serious number crunching without that infrastructure in place. People moved from Fortran and Matlab to Python not because it is fast, but for the environment. These language tricks are cute and all (I like D well enough, don't get me wrong), but it ain't why we are using Python.

At this point, if I were to switch languages to something without a lot of adoption I'd lean towards Julia. It also have a modern language design, but it is written from the ground up for numerical computation. I can't think of any reason I'd ever reach for D.

[+] vidarh|10 years ago|reply

> Does D have code for: [...]

He covered this, specifically out that Numpy has an advantage in the number of libraries and other resources available for it.

> because everyone else in the world knows and uses them;

No, everyone working in some specialized niches does. And those niches don't even scratch the surface of the number of people who regularly do simple operations on multi-dimensional arrays but who don't need most of what you list.

And sometimes people even in your niche that does depend on the other stuff run into things that are just way too slow, that they have to rewrite in a lower level language.

There are plenty of people for whom this will be very useful without having all the support infrastructure around Numpy in place.

> I can't think of any reason I'd ever reach for D.

D is for people who wants a "C++ done right", where it is implicit that you want a C-like language, and share ideals of C++ such as to whatever extent possible not paying for what you don't use.

Specifically in this case, you'd reach for D if you need multi-dimensional arrays and either prefer a C++-like language over Python or your Python code is too slow and you prefer to avoid a lower level language like C.

That may not be for you, and that's fine. Just like Numpy isn't nearly universal for numerical computing.

> Yes, my np.mean() might be slower than your map(). I almost always don't care.

I find this an amusing objection... I did my MSc thesis on methods for reducing error rates in OCR, and included methods that used various nearest-neighbour variations and statistical methods on bitmap slices and similar.

I did it mainly in Ruby (with some inline C where absolutely necessary for performance - if I remember correctly it was about a dozen lines of code in C) because I cared far more about readable code than fast code, and Python would in my eyes have been a double loss: not the speed of C, yet still ugly and annoying to work with in my opinion.

Seeing the D code examples, I might very well have picked D for it today.

[+] n00b101|10 years ago|reply

>Are those all in standard libraries, fully documented, backed by 60 year old, fully debugge code (LAPACK, etc) ... yes, my np.mean() might be slower than your map(). I almost always don't care.

Hmm ... LAPACK itself is only 23 years old (1992). Its was implemented to be a faster replacement of the "standard" LINPACK library, by effectively exploiting the caches on modern processors.

For most serious users of numerical computing, performance is extremely important.

Also, I think you might be putting far too much faith in an arbitrary assortment of Python libraries. If you think that NumPy is just a thin wrapper on LAPACK, and you are using it in mission or life critical operations, then I think you have some code vetting to do.

Even if some of your chosen libraries do use LAPACK, and even if there is some code in LAPACK that is 60 years old, that does not suddenly mean that all of your chosen Python libraries (and your code on top of them) is all correct. Most serious users of many of the routines that you mention (e.g. "machine learning" and "financial calculations") implement and test these routines themselves in order to ensure their correctness. In very serious model vetting operations, it would be customary to even invest in two different independent implementations of the same routine.

> I reliably email to anyone across the world and they can immediately run and modify my code because it is such a standard?

I don't think "reliable" and "emailing scripts" normally go together. It would be more customary to use a source code control system in a high reliability environment. Personally I have found reproducing Python installations to be very tricky and I would take extreme care in any critical operations that assume two different Python installations produce the exactly the same results (high reliability, like some of the 60+year old codes run by NASA, would even require bug for bug reproducibility).

"Python/Numpy/Scipy/Pandas/Matplotlib" are not the last word on anything, theyt are not standardized as you imply, and they leave vast room for improvement in all dimensions (performance, correctness, reliability, productivity, etc).

[+] 9il|10 years ago|reply

Yes, Julia is amazing! In the same time, if you want to write a package for Julia you _may_ need to use C/C++. D is going to have integration with Julia in 2016 ;)

D already have good integration with Python. You may want to read this article http://d.readthedocs.org/en/latest/examples.html#plotting-wi... (it may be a little bit outdated).

[+] dallbee|10 years ago|reply

A lot of that is up and coming.

https://github.com/DlangScience

[x] Plotting: http://code.dlang.org/packages/plot2kill

[+] Optimization: Nothing directly for the purpose, but the intermediate steps are done. And a little bit of work from http://code.dlang.org/packages/atmosphere

[x] Probability Distributions: http://dlangscience.github.io/dstats/api/dstats/random.html and http://dlang.org/phobos/std_mathspecial.html

[-] Machine Learning: Not that I'm aware of, but I vaguely recall some work being done on this

[-] Financial calculations: I couldn't find anything, but I'd be surprised if this wasn't implemented already.

[x] Masked Arrays: Accomplished via language features in combination with ndslice

[x] Structured Arrays: Absolutely. Standard library support as well as a number of 3rd-party libs can do this.

[-] SVD

[?] QR: Afraid I'm not sure what you're referring to.

[+] Cholesky Decomposition: Trivial to implement

[x] Eigenvalues/Eigenvectors: Yes. SciD provides Eigens as well as two other third party packages. https://dlangscience.github.io/scid/api/scid/linalg/eigenval...

[x] Least Squares (linear): Trivial to implement, but also implemented by a few of the 2-d & 3-d graphics libraries.

[x] Least Squares (nonlinear): Nothing, but we both know that this is easily done by applying a function to your independent variable to linearize it in many cases.

[-] Levenberg Marquardt

[x] Matrix Inverse: Provided by SciD

[?] pseudoinverses: I'm afraid my math background doesn't cover that, and I couldn't find anything with a similar naming in the package repository

[x] Integration: SciD provides integration

[-] Range Kutta

[-] Interpolation: Sort of. A few of the graphics libraries have this, but it's asking a bit much to include a graphics library to do interpolation.

[-] Bsplines

[-] fft convoles

[?] Multidimensional images

[?] KDTrees: Probably in one of the 3d graphics libs

[+] Symbolic equation solvers: Some of this was done for Tango back in the day, and it has been ported, but the project looks fairly dead. https://github.com/opticron/libdmathexpr/blob/master/source/...

[x] Merge/join of data sets: Can be done efficiently and easily with core language features

Additionally, anything written for C can be trivially wrapped and used in D. An example is http://scimath.com/, which completes much of what you mentioned.

Are they in standard libraries? SciD needs more time Fully Documented? D tends to encourage good API documentation, and the algorithms are largely ports of "69 year old, fully debugged code".

I think you're point about "reliably email to anyone across the world" is an overstatement. Python is popular, not ubiquitous. D is obviously much less popular. Can anyone run the code? Yes. For them to modify it, they need to learn the language.

You might not care about how long computations take, but I know back at my old university there were dozens of researchers complaining about the resources that had to be spent on supercomputer time, and the annoying amount of time that many calculations involving things like molecular configurations can take. Speeding this up saves money and makes the research less painful.

"No one is ever going to use D for serious number crunching without the infrastructure in place" - Yeah, that's totally true. The infrastructure is a WIP. There's nothing wrong with that.

Your post comes across to me as being rather cynical - but why not be supportive of the good work that's being done?

[+] jboy|10 years ago|reply

Credit to the D developers for providing a concise, carefully-designed library for N-D array processing. The chained method invocations demonstrate D's UFCS (Uniform Function Call Syntax) nicely. And it's a definite bonus that you can use underscore like a comma separator in long integer literals (eg, `100_000`).

But if you use Python + Numpy/Scipy/Matplotlib and you're looking for a modern, compiled language for execution speedups or greater flexibility than what Numpy broadcasting operations provide by default, I would recommend Nim. It's as fast as C++ or D, it has Pythonic syntax, and it already includes many of D's best features (including type inference, UFCS, and underscores in integer literals).

And best of all, you don't need to rewrite all your existing Python+Numpy code into a new language to start using Nim.

The Pymod library we've created allows you to write Nim functions, compile them as standard CPython extension modules, and simply drop them into your existing Python code: https://github.com/jboy/nim-pymod

The Pymod library even includes a type `ptr PyArrayObject` that provides native Nim access to Numpy ndarrays via the Numpy C-API [ https://github.com/jboy/nim-pymod#pyarrayobject-type ]. So you can bounce back and forth between your Python code and your Nim code for the cost of a Python extension module function call. All of Numpy, Scipy & Matplotlib are still available to you in Python, in addition to statically-typed C++-like iterators in Nim+Pymod [ https://github.com/jboy/nim-pymod#pyarrayiter-types , https://github.com/jboy/nim-pymod#pyarrayiter-loop-idioms ]. The Nim for-loops will be compiled to C code that the C compiler can then auto-vectorize.

[+] 9il|10 years ago|reply

D has integration with Python/Matplotlib too =P http://pyd.readthedocs.org http://d.readthedocs.org/en/latest/examples.html#plotting-wi...

[+] stevieboy|10 years ago|reply

jboy, Can nim-pymod be used as VLA's for nim? I'm not too fond of the nim seq'type (bit slow for my usage) and prefer arrays, but need their length allocated at runtime. Can this be done via (albeit a clunky route) through nim-pymod? i.e arrays created and accessed all in nim (no python)?

[+] zardeh|10 years ago|reply

And here we have a case of why microbenchmarks don't work. What you're measuring here isn't a speed difference in the mathematical code, its a constant time overhead from calling into the numerical libs. Up your array size by 100 times and this will become evident.

Why do I say this? Because inlining the python function to

means = numpy.mean(numpy.arange(100000).reshape((100, 1000)), axis=0)

from the original example in the article cut the benchmark time in down from around 215us to 205 us in my testing. That was done by removing a single python bytecode instruction.

Its quite likely that the D numerical code is actually slower than the LAPACK based python numerical code, but you're hiding this in the constant time overhead of a few python function calls.

[+] bionsuba|10 years ago|reply

As I stated in the article, I did not include the array creation in the benchmark in order to be fair to Numpy with its slow initialization times. The only python code that I benchmarked was the numpy.mean line.

[+] cannam|10 years ago|reply

I occasionally rewrite Python+NumPy signal processing code in C++ for purposes of packaging and integration with native apps, so I read these examples with an eye to how they compare with typical C++, rather than with NumPy. They compare very well, and it would never have occurred to me to look into D as a possibility for this sort of code.

I'm guessing the GC might rule it out for many cases where you do signal processing in C++, but I may as well ask: what's the deployment side of things like? Can I easily build a shared library and use it from a C++ application?

[+] jboy|10 years ago|reply

You might also be interested to check out Nim. It transpiles to C before invoking the C compiler, so it runs as fast as C++ and has excellent C-compatibility (and by extension, excellent C++-compatibility).

Compiling a shared library is as easy as passing the "--app:lib" option to the Nim compiler: http://nim-lang.org/docs/nimc.html#compiler-usage-command-li...

The GC is optional; you can manage your memory manually if you prefer: http://nim-lang.org/docs/manual.html#types-reference-and-poi...

The Nim tutorial is here if you want to have a quick skim: http://nim-lang.org/docs/tut1.html

[+] snydly|10 years ago|reply

Am I correct in thinking that this is only reasonable if you're already using D? The switching cost seems too high if you're python everything.

I tried the Armadillo C++ library a while ago (http://arma.sourceforge.net/). The speed up and time spend learning the syntax didn't seem worth it.

[+] bionsuba|10 years ago|reply

I completely understand that for existing projects it might not make sense to switch, but as I say at the start of the article

    why you should consider D for your next numerical project.

[+] p4wnc6|10 years ago|reply

The central claim of the post seems summarized by this quote:

> For example, when using a non-numpy API or functions that don't use Numpy that return regular arrays, you either have to use the normal Python functions (slow), or use np.asarray which copies the data into a new variable (also slow).

but I disagree strongly with this.

First of all, if there is a common use case for some set of operations that need to be performed on very large data (the type of data you'd look to NumPy to handle), then generally there is already a subpackage within numpy/scipy/scikits/pandas/etc that already deals with that use case and natively handles it with NumPy arrays, with no switching cost to convert back forth between lists or tuples or whatever.

And, of course, when a list/tuple-heavy API is only meant to deal with small data, it's not a problem to use NumPy's facilities for converting between ndarray and the builtin array types. In cases where you're dealing with a huge breadth of small data, then that casts doubt on whether you should be using NumPy; it wouldn't be casting doubt on whatever the other list/tuple-heavy API is. And probably parallelization (or even the buffer stuff I mention below) is a fine solution in that case.

Second, in a lot of cases you can make use of the Python Buffer Protocol to share the underlying data of a NumPy array without copying it. This won't help if some other API expects Python lists or tuples, but the great thing about dynamic typing in Python is that all that really matters is that whatever underlying buffer type you need implements whatever methods that other API expects to call.

You can always write your own extension type that adheres to the Buffer Protocol and also provides whatever API is needed to conform to some other library, so the power to create these double-sided adapters (one side sharing data with NumPy, the other side appearing like a drop-in acceptable data structure for the other library API) is very powerful and generic. It might take some getting used to the first few times you do this, but if you use tools like Cython to help, it's really quite easy to do, easy to maintain, and solves a surprisingly wide range of NumPy integration problems. In fact, these things generally already exist for most problems you will run into and ultimately they often boil down to simple Cython-based wrappers around C bindings to the other Python API you're working with.

I would argue that the existence of this Buffer Protocol adapter strategy alone is enough to say that the switching cost to D is virtually never worth it, and still pretty speculative even if you're starting a brand new numerical computing project.

Finally, most Python libraries that heavily rely on the list or tuple APIs are not meant for large data (those APIs mostly already just use NumPy, as I mentioned, or else they use generators and let the end user decide which array type will eventually be instantiated as the results are consumed). It's not common, by intentional design, for list/tuple-heavy APIs to need to cope with large data, so when someone says something off the cuff like "What do you do when some library API needs lists and you've got NumPy arrays?" it sounds like a worrisome case, but in practice it's really, really uncommon that such a situation arises and no one else ran into it before you and no one has created a NumPy-compliant solution already. It's not impossible, of course. Just unlikely, and probably not important enough to use as a basis for language choice, unless you're facing a really special sort of API problem.

Edit: None of this should read at all as a criticism of the D language or this particular implementation of ndarray data structures. All that stuff is great and anyone wishing to use D absolutely should.

I'm only arguing against the post's central thesis insofar as it is used to justify considering D as an alternative to scientific Python. The problem that the post points out already has solutions solely in the Python ecosystem, many projects have handled that problem before, and the problem is pretty rare and esoteric anyway, so it's probably not a good thing to use as the basis of an important choice like which language to use, or whether to switch languages.

There could be many reasons to prefer D over scientific Python depending on a given use case, and there could be certain situations where switching from Python to D is a good idea. Whatever those cases may be, the central issue of this post, performance degradation caused by NumPy-to-other-API compatibility, is not one of them.

[+] srean|10 years ago|reply

Very nicely put and I agree. What I was expecting to see in the list of numpy problems mentioned in the article wasn't there. The major problem that makes Numpy performance lag behind C++ or Fortran is the extra level of indirection that is needed for accessing an element (via stride ptr), and the need for extra copies that is forced on you by vectorization. Numexpr can help for certain cases of the latter, but its still quite limited in the type of expressions that numexpr can handle. It is my belief that with some local static analysis both can be mitigated somewhat.

It would be really interesting to see whether D's nd object tackles these issues. The copy of arrays across function boundaries, as you correctly pointed out, is mostly a red herring.

[+] tadlan|10 years ago|reply

Use numba to compile python loops or array expressions to fast llvm, and problem solved. I'm sticking with python.

[+] bionsuba|10 years ago|reply

Considering that in the benchmarked example, only one like of Numpy code was used which already uses compiled C, I have a hard time believing that that would catch up to using all compiled code.

[+] fizixer|10 years ago|reply

click-bait. Why should I change my language even if I'm looking for a numpy replacement?

[+] bionsuba|10 years ago|reply

How is this clickbait? Did I in any way misrepresent the content of the article?

If you don't like D and don't want to use it move on to the next item on the front page.

58 comments