top | item 36623823

(no title)

414owen | 2 years ago

The version that's friendly to the compiler is described in part two: https://owen.cafe/posts/the-same-speed-as-c/

It achieves 3.88GiB/s

I intentionally didn't go down the route of vectorizing. I wanted to keep the scope of the problem small, and show off the assembly tips and tricks in the post, but maybe there's potential for a future post, where I pad the input string and vectorize the algorithm :)

discuss

nwallin|2 years ago

So I downloaded your code. On my desktop, with loop-9 gcc I got ~4.5GB/s, and with loop-7 I got ~4.4GB/s. With the following code:

    #include <stddef.h>
    
    int run_switches(const char *s, size_t n) {
      int res = 0;
      for (; n--; ++s)
        res += (*s == 's') - (*s == 'p');
      return res;
    }

I got ~31GB/s in GCC and ~33GB/s in Clang. This is without any padding, or SIMD intrinsics, or any such nonsense. This is just untying the compiler's hands and giving it permission to do its job properly.

Don't want to pass the string length? That's fine, we can figure that out for ourselves. This code:

    #include <stddef.h>
    #include <string.h>

    int run_switches(const char *s) {
      int res = 0;
      for (size_t n = strlen(s); n--; ++s)
        res += (*s == 's') - (*s == 'p');

      return res;
    }

Is 27GB/s. With a little bit of blocking:

    #include <stddef.h>
    
    int run_switches(const char *s, size_t n) {
      int res = 0;
      char tmp = 0;
      for (size_t i = n & 63; i--; ++s)
        tmp += (*s == 's') - (*s == 'p');
      res += tmp;
    
      for (n >>= 6; n--;) {
        tmp = 0;
        for (size_t i = 64; i--; ++s)
          tmp += (*s == 's') - (*s == 'p');
        res += tmp;
      }
    
      return res;
    }

That's ~55GB/s.

Anyway, the point is, you're pretty far from the point where you ought to give up on C and dive into assembly.

utopcell|2 years ago

Indeed. I suppose the two lessons are, stick with C, and don't forget the semantics of your original problem when optimizing.

    int run_switches(const char *s) {
      int res = 0;
      uint8_t tmp = 0;
      size_t n = strlen(s);
      for (size_t i = n & 127; i--; ++s)
        tmp += (*s == 's');
      res += tmp;
    
      for (size_t j = n >> 7; j--;) {
        tmp = 0;
        for (size_t i = 128; i--; ++s)
          tmp += (*s == 's');
        res += tmp;
      }
    
      return 2 * res - n;
    }

magicalhippo|2 years ago

Another good reason to write optimization-friendly C (or similar) over assembly code, especially in libraries, is that the compiler will evolve with CPUs, while the assembly code will not.

I've seen plenty of cases where replacing hand-written assembly with C (or similar) lead to a substantial performance increase because the assembly code was written for some old CPU and not the best way of doing things on current CPUs.

rajnathani|2 years ago

This seems like the most efficient solution. I have a neighboring comment on this post which suggests using bit arithmetic, but the above solution is more efficient than that. Here’s what the assembly code for the body of the first loop compiles down to (I had to use ChatGPT-4 as godbolt unfortunately doesn’t work on mobile):

    cmp dl, 's'    ; Compare the character with 's'
    sete dl        ; If the character is 's', set dl to 1. Otherwise, set dl to 0.
    sub al, dl     ; Subtract the result from res

    cmp dl, 'p'    ; Compare the character with 'p'
    sete dl        ; If the character is 'p', set dl to 1. Otherwise, set dl to 0.
    add al, dl     ; Add the result to res

SleepyMyroslav|2 years ago

>Anyway, the point is, you're pretty far from the point where you ought to give up on C and dive into assembly.

Thank you. I hope people who post random assembly listings on HN written in some extinct ISA will read your posts.