top | item 45406379

(no title)

sfilmeyer | 5 months ago

I enjoyed reading the article, but I'm pretty thrown by the benchmarks and conclusion. All of the times are reported to a single digit of precision, but then the summary is claiming that one function shows an improvement while the other two are described as negligible. When all the numbers presented are "~5ms" or "~6ms", it doesn't leave me confident that small changes to the benchmarking might have substantially changed that conclusion.

discuss

gizmo686|5 months ago

Yeah. When your timing results are a single digit multiple of your timing precision, that is a good indication you either need a longer test, or a more precise clock.

At a 5ms baseline with millisecond precision, the smallest improvement you can measure is 20%. And you cannot distinguish a 20% speedup with a 20% slowdown that happened to get luck with clock ticks.

For what it is worth, I ran the provided test code on my machine with a 100x increase in iterations and got the following:

  == Benchmarking ABS ==
  ABS (branch):     0.260 sec
  ABS (branchless): 0.264 sec

  == Benchmarking CLAMP ==
  CLAMP (branch):     0.332 sec 
  CLAMP (branchless): 0.538 sec

  == Benchmarking PARTITION ==
  PARTITION (branch):     0.043 sec
  PARTITION (branchless): 0.091 sec

Which is not exactly encouraging (gcc 13.3.0, -ffast-math -march=native. I did not use the -fomit-this-entire-function flag, which my compiler does not understand).

I had to drop down to O0 to see branchless be faster in any case:

  == Benchmarking ABS ==
  ABS (branch):     0.743 sec
  ABS (branchless): 0.948 sec

  == Benchmarking CLAMP ==
  CLAMP (branch):     4.275 sec
  CLAMP (branchless): 1.429 sec

  == Benchmarking PARTITION ==
  PARTITION (branch):     0.156 sec
  PARTITION (branchless): 0.164 sec

Roxxik|5 months ago

I also tried myself, on different array sizes, with more iterations. The branchy version is not strictly worse.

https://gist.github.com/Stefan-JLU/3925c6a73836ce841860b55c8...

Someone|5 months ago

> I had to drop down to O0 to see branchless be faster in any case

Did you check whether your branchy code actually still was branchy after the compiler processed it at higher optimization levels?

Joel_Mckay|5 months ago

In general, modern compilers will often unroll or inline functions without people even noticing. This often helps with cache level state localization and parallelism.

Most code should focus on readability, then profile for busy areas under use, and finally refactor the busy areas though hand optimization or register hints as required.

If one creates something that looks suspect (inline Assembly macro), a peer or llvm build will come along and ruin it later for sure. Have a great day =3

hinkley|5 months ago

Doesn’t it also help with branch prediction since the unrolled loop can use different statistics with each copy?