(no title)
dsharlet | 2 years ago
If you have a specific problem with constraints you can exploit (e.g. known fixed dimensions, sparsity patterns, data layouts, type conversions, etc.), it's not hard at all to beat MKL, etc... if you are using a language like C++. If you are using python, you have no chance.
It isn't even necessarily that different from a few nested loops. Clang is pretty damn good at autovectorizing, you just have to be a little careful about how you write the code.
geysersam|2 years ago
p-e-w|2 years ago
Of course you do. Every special-case multiplication algorithm you might need already has an optimized implementation that you can just `pip install`, and move on with what you're actually working on.
The whole scientific computing world runs on Python. Straightforward numerics code using NumPy tends to murder C/C++ code in regard to performance, unless that code is written by people who make a living hand-optimizing computational routines.
SideQuark|2 years ago
If you ignore the majority of scientific code running on supercomputers doing most of science in C++ and Fortran.
Even in areas where python is used, the majority of the compute runs on C/C++/Fortran, with a little python as glue.
If you think numpy (written in c/c++) murders c/c++ code, you should learn about HPC, where really high performance happens. They don't use numpy.
bjourne|2 years ago
I'll file this under "talk is cheap". :) I tried it last year and got within 50% of BLAS. Getting above that is tons of work. Which you have to repeat for every processor model, NUMA, and every combination of matrix type (long thin, short wide, etc).
dsharlet|2 years ago
The less involved versions still get ~70%.
But this is also quite general. I’m claiming you can beat BLAS if you have some unique knowledge of the problem that you can exploit. For example, some kinds of sparsity can be implemented within the above example code yet still far outperform the more general sparsity supported by MKL and similar.
lifthrasiir|2 years ago