No, you can still trust compilers: 1) The hand-tuned BLAS routines are essentially a different algorithm with hard-coded information. 2) The default OpenBLAS uses OpenMP parallelism, so much speed likely originates from multithreading. Set OMP_NUM_THREADS environment variable to 1 before running your benchmarks. You will still see a significant performance difference due to a few factors, such as extra hard-coded information in OpenBLAS implementation.
marshallward|1 year ago
As for the original post, I felt a bit embarrassed about my original comments, but I think the compilers actually did fairly well based on what they were given, which I think is what you are saying in your first part.