(no title)
dsharlet | 2 years ago
Here's what I see:
$ clang++ --version
clang version 18.0.0
$ time make bin/matrix
mkdir -p bin
clang++ -I../../include -I../ -o bin/matrix matrix.cpp -O2 -march=native -ffast-math -fstrict-aliasing -fno-exceptions -DNDEBUG -DBLAS -std=c++14 -Wall -lstdc++ -lm -lblas
1.25user 0.29system 0:02.74elapsed 56%CPU (0avgtext+0avgdata 126996maxresident)k
159608inputs+120outputs (961major+25661minor)pagefaults 0swaps
$ bin/matrix
...
reduce_tiles_z_order time: 3.86099 ms, 117.323 GFLOP/s
blas time: 0.533486 ms, 849.103 GFLOP/s
$ OMP_NUM_THREADS=1 bin/matrix
...
reduce_tiles_z_order time: 3.89488 ms, 116.303 GFLOP/s
blas time: 3.49714 ms, 129.53 GFLOP/s
My inner loop in perf: https://gist.github.com/dsharlet/5f51a632d92869d144fc3d6ed6b...
BLAS inner loop in perf (a chunk of it, it is unrolled massively): https://gist.github.com/dsharlet/5b2184a285a798e0f0c6274dc42...Despite being on a current-ish version of clang, I've been getting similar results from clang for years now.
Anyways, I'm not going to debate any further. It works for me :) If you want to keep writing code the way you have, go for it.
bjourne|2 years ago
dsharlet|2 years ago
You're probably now comparing parallel code to single threaded code.