I actually did some benchmarking on this once upon a time[1] and it turns out that it's actually really easy to write something that is only ~5x slower[2] (until you get to matrices that don't fit in RAM):
Obviously OpenBLAS is so easy to package that it's not really worth avoiding it, but it was very eye-opening to see just how easy it is to get within an order of magnitude (easier, in fact, than getting into the 10x-20x range).
Sean1708|6 years ago
Obviously OpenBLAS is so easy to package that it's not really worth avoiding it, but it was very eye-opening to see just how easy it is to get within an order of magnitude (easier, in fact, than getting into the 10x-20x range).
[1]: https://gist.github.com/Sean1708/69c5694048e9a9ca7bd84fcbc9e...
[2]: 8-core 3.4GHz Haswell i7 with 32kB L1, 256kB L2, 8MB L3, and 8GB RAM.
tda|6 years ago