top | item 38868948

(no title)

dsharlet | 2 years ago

The makefile asks for -O2 with clang. I find that -O3 almost never helps in clang. (In gcc it does.)

Here's what I see:

   $ clang++ --version
   clang version 18.0.0

   $ time make bin/matrix
   mkdir -p bin
   clang++ -I../../include -I../ -o bin/matrix matrix.cpp  -O2 -march=native -ffast-math -fstrict-aliasing -fno-exceptions -DNDEBUG -DBLAS  -std=c++14 -Wall -lstdc++ -lm -lblas
   1.25user 0.29system 0:02.74elapsed 56%CPU (0avgtext+0avgdata 126996maxresident)k
   159608inputs+120outputs (961major+25661minor)pagefaults 0swaps

   $ bin/matrix
   ...
   reduce_tiles_z_order time: 3.86099 ms, 117.323 GFLOP/s
   blas time: 0.533486 ms, 849.103 GFLOP/s

   $ OMP_NUM_THREADS=1 bin/matrix
   ...
   reduce_tiles_z_order time: 3.89488 ms, 116.303 GFLOP/s
   blas time: 3.49714 ms, 129.53 GFLOP/s
My inner loop in perf: https://gist.github.com/dsharlet/5f51a632d92869d144fc3d6ed6b... BLAS inner loop in perf (a chunk of it, it is unrolled massively): https://gist.github.com/dsharlet/5b2184a285a798e0f0c6274dc42...

Despite being on a current-ish version of clang, I've been getting similar results from clang for years now.

Anyways, I'm not going to debate any further. It works for me :) If you want to keep writing code the way you have, go for it.

discuss

order

bjourne|2 years ago

-O2 did improve performance significantly, but it's still 0.7 s for NumPy and 5.1 seconds for your code on 4096x4096 matrices. Either you're using a slow version of BLAS or you are benchmarking with matrices that are comparatively tiny (384x1536 is nothing).

dsharlet|2 years ago

BLAS is getting almost exactly 100% of the theoretical peak performance of my machine (CPU frequncy * 2 fmadd/cycle * 8 lanes * 2 ops/lane), it's not slow. I mean, just look at the profiler output...

You're probably now comparing parallel code to single threaded code.