top | item 38851344

(no title)

dsharlet | 2 years ago

This is really overstating how hard it is to compete with matrix multiply libraries. The main reason those libraries are so big and have had so much work invested in them is their generality: they're reasonably fast for almost any kind of inputs.

If you have a specific problem with constraints you can exploit (e.g. known fixed dimensions, sparsity patterns, data layouts, type conversions, etc.), it's not hard at all to beat MKL, etc... if you are using a language like C++. If you are using python, you have no chance.

It isn't even necessarily that different from a few nested loops. Clang is pretty damn good at autovectorizing, you just have to be a little careful about how you write the code.

discuss

order

geysersam|2 years ago

If you need a really fast compiled inner loop then you can often implement it in Python using Numba. Using that you can easily implement something like sparse matrix multiplication in Python.

p-e-w|2 years ago

> If you are using python, you have no chance.

Of course you do. Every special-case multiplication algorithm you might need already has an optimized implementation that you can just `pip install`, and move on with what you're actually working on.

The whole scientific computing world runs on Python. Straightforward numerics code using NumPy tends to murder C/C++ code in regard to performance, unless that code is written by people who make a living hand-optimizing computational routines.

SideQuark|2 years ago

> The whole scientific computing world runs on Python

If you ignore the majority of scientific code running on supercomputers doing most of science in C++ and Fortran.

Even in areas where python is used, the majority of the compute runs on C/C++/Fortran, with a little python as glue.

If you think numpy (written in c/c++) murders c/c++ code, you should learn about HPC, where really high performance happens. They don't use numpy.

bjourne|2 years ago

> This is really overstating how hard it is to compete with matrix multiply libraries.

I'll file this under "talk is cheap". :) I tried it last year and got within 50% of BLAS. Getting above that is tons of work. Which you have to repeat for every processor model, NUMA, and every combination of matrix type (long thin, short wide, etc).

dsharlet|2 years ago

This gets to 90% of BLAS: https://github.com/dsharlet/array/blob/38f8ce332fc4e26af0832...

The less involved versions still get ~70%.

But this is also quite general. I’m claiming you can beat BLAS if you have some unique knowledge of the problem that you can exploit. For example, some kinds of sparsity can be implemented within the above example code yet still far outperform the more general sparsity supported by MKL and similar.

lifthrasiir|2 years ago

You need to have enough experience to be able to be a little careful though. This is generally true for most languages, loop unrolling works even better in Python for example, but many Python programmers aren't even aware of this possibility.