I'm not so sure that the right take-away is "hand-written assembler is 6x faster than C." It's more like "jumps are a lot slower than conditional arithmetic." And that can [edit:often] be achieved easily in C by simply not using switch statements when an if statement or two will work fine.
Rewriting the C function as follows got a 5.5x speedup:
int run_switches(char *input) {
int r = 0;
char c;
while (1) {
c = *input++;
if (c == 's') r++;
if (c == 'p') r--;
if (c == '\0') break;
}
return r;
}
Results:
[16:50:14 user@boxer ~/looptest] $ gcc -O3 bench.c loop1.c -o lone
[16:50:37 user@boxer ~/looptest] $ gcc -O3 bench.c loop2.c -o ltwo
[16:50:47 user@boxer ~/looptest] $ time ./lone 1000 1
449000
./lone 1000 1 3.58s user 0.00s system 99% cpu 3.589 total
[16:50:57 user@boxer ~/looptest] $ time ./ltwo 1000 1
449000
./ltwo 1000 1 0.65s user 0.00s system 99% cpu 0.658 total
What version of GCC are you using? For me both versions perform the same, both on Ubuntu and Windows:
$ time ./lone 1000 1
851000
real 0m3.578s
user 0m3.574s
sys 0m0.004s
$ time ./ltwo 1000 1
851000
real 0m3.583s
user 0m3.583s
sys 0m0.000s
$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Is rewriting switch statements to a bunch of ifs always faster? Or is there some number of cases where the switch is faster? Seems like it should be added as a compiler optimization if it's consistent.
Yes, but this will backfire on ARM, where jumps are as roughly fast as conditional arithmetic.
The whole point of using C is not to think about the underlying architecture. As soon as you start taking "jumps are a lot slower than conditional arithmetic on x86" into account, you're not writing in C, you're writing in assembly with extra steps :-)
Is there any good article comparing performance across programming languages? Seems like everytime I see one they're broken because the tested logic is poorly implemented in language(s) XYZ.
IMHO the original code wasn't written in a way that's particularly friendly to compilers. If you write it like this:
int run_switches_branchless(const char* s) {
int result = 0;
for (; *s; ++s) {
result += *s == 's';
result -= *s == 'p';
}
return result;
}
...the compiler will do all the branchless sete/cmov stuff as it sees fit. It will be the same speed as the optimized assembly in the post, +/- something insignificant. However it won't unroll and vectorize the loop. If you write it like this:
int run_switches_vectorized(const char* s, size_t size) {
int result = 0;
for (; size--; ++s) {
result += *s == 's';
result -= *s == 'p';
}
return result;
}
It will know the size of the loop, and will unroll it and use AVX-512 instructions if they're available. This will be substantially faster than the first loop for large inputs, although I'm too lazy to benchmark just how much faster it is.
Now, this requires knowing the size of your string in advance, and maybe you're the sort of C programmer who doesn't keep track of how big your strings are. I'm not your coworker, I don't review your code. Do what you want. But you really really probably shouldn't.
I intentionally didn't go down the route of vectorizing. I wanted to keep the scope of the problem small, and show off the assembly tips and tricks in the post, but maybe there's potential for a future post, where I pad the input string and vectorize the algorithm :)
Nice! I tried it in Nim and it appears to trigger it with:
{.overflowChecks:off.}
proc run_switches*(input: cstring): int {.exportc.} =
result = 0
for c in input:
result.inc int('s' == c)
result.dec int('p' == c)
That gives a ~5x speedup on an Apple M1. Keeping overflow checks on only gets it up to ~2x the default C version. Always nice to know good ways to trigger SIMD opts.
I’m probably an optimization expert, and I would solve that problem completely differently.
On my computer, the initial C version runs at 389 MB / second. I haven’t tested the assembly versions, but if they deliver the same 6.2x speedup, would result in 2.4 GB/second here.
Interesting. I think you can vectorize the prologue using movemask + popcnt instead of keeping a counter in the ymm registers (warning: untested code, still need to benchmark it):
const __m256i zero = _mm256_setzero_si256();
const __m256i s = _mm256_set1_epi8( 's' );
const __m256i p = _mm256_set1_epi8( 'p' );
const size_t a = (size_t)input;
const size_t rem = a % 32;
const char* aligned = input - rem;
const __m256i v = _mm256_load_si256(( const __m256i*) input);
const __m256i z = _mm256_cmpeq_epi8( v, zero );
size_t m_plus = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, s));
size_t m_minus = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, p));
size_t m_zero = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, z));
size_t offset_zero = _mm_tzcnt_64(m_zero >> rem);
m_plus = _bzhi_u64(m_plus >> rem, offset_zero);
m_minus = _bzhi_u64(m_minus >> rem, offset_zero);
// Skip loop we already found the end of the string...
while (m_zero == 0) {
// ...
}
// ...
return m_plus + res - m_minus;
This code screams for SIMD! If you can change the prototype to take an explicit length, you could easily read and process 16 bytes at a time (the compares will give you values you can just add and subtract directly). Heck, even calling strlen() at the function's start to get the explicit length would probably be worth it.
I threw together a quick risc-v vectorized implementation:
size_t run(char *str) {
uint8_t *p = (uint8_t*)str;
long end = 0;
size_t res = 0, vl;
while (1) {
vl = __riscv_vsetvlmax_e8m8();
vuint8m8_t v = __riscv_vle8ff_v_u8m8(p, &vl, vl);
end = __riscv_vfirst_m_b1(__riscv_vmseq_vx_u8m8_b1(v, '\0', vl), vl);
if (end >= 0)
break;
res += __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 's', vl), vl);
res -= __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 'p', vl), vl);
p += vl;
}
vl = __riscv_vsetvl_e8m8(end);
vuint8m8_t v = __riscv_vle8_v_u8m8(p, vl);
res += __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 's', vl), vl);
res -= __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 'p', vl), vl);
return res;
}
Here are the results from the above, the switch and the table c implementation, ran on my mangopi mq pro (C906, in order rv64gc with rvv 0.7.1, and a 128 bit vector length):
switch: 0.19 Bytes/Cycle
tbl: 0.17 Bytes/Cycle
rvv: 1.57 Bytes/Cycle (dips down to 1.35 after ~30 KiB)
Edit: you can go up to 2/1.7 Bytes/Cycle, if you make sure the pointer is page aligned (and vl isn't larger than the page size), see comments
To be fully correct, you'd need the load to be a fault-only-first load (which rvv does have), otherwise that could fail if the null byte was just before the end of allocated memory.
The good question is what do we need C for in general? Of course, we can hand-tailor our code to run best on one particular piece of hardware. And we don't need C for that, it would be the wrong tool. We need assembly (and a decent macro system for some sugar)
But the original goal of C was to make translating system-level code from one platform to another easier. And we're expected to lose efficiency on this operation. It's like instead of writing a poem in Hindi and translating it in Urdu, we write one in Esperanto and then translate to whatever language we want automatically. You don't get two brilliant poems, you only get two poor translations, but you get them fast. That's what C is for.
Rearranging branches (and perhaps blocks too?) will definitely be done if you are building using FDO, because without FDO (or PGO) the compiler has no idea how likely each branch is to be taken. Cmov can also be enabled by FDO in some cases.
However, whether or not using cmov is effective compared to a regular test/jump is highly dependent on how predictable the branch is, with cmov typically performing better when the branch is very unpredictable. Since they got a 6x speedup with cmov, I assume that their test input (which isn't described in the post, and is also not in their GitHub repo) consists of random strings consisting almost entirely of s and p characters. There's nothing wrong with this, but it does make the post seem a little misleading to me, as their clever speedup is mostly about exploiting an unmentioned property of the data that is highly specific to their benchmark.
> I assume that their test input (which isn't described in the post, and is also not in their GitHub repo) consists of random strings consisting almost entirely of s and p characters.
test code is here: https://github.com/414owen/blog-code/blob/master/02-the-same... it randomly selects between 's' or 'p'. The characters can't be anything other than 's', 'p', or the terminating null. Knowing that particular fact about our input gives us this ...clever... optimization:
int run_switches(const char* s) {
int result = 0;
while (*s)
result += (1 | *s++) - 'r';
return result;
}
which compiles to:
run_switches:
movzx eax, BYTE PTR [rdi]
xor edx, edx
test al, al
je .L1
.L3:
or eax, 1
inc rdi
movsx eax, al
lea edx, [rdx-114+rax]
movzx eax, BYTE PTR [rdi]
test al, al
jne .L3
.L1:
mov eax, edx
ret
This is too clever by half, of course, but it perfectly illustrates your point about exploiting properties of the data.
> because without FDO (or PGO) the compiler has no idea how likely each branch is to be taken
So, the maximum amount of times you can hit '\0' is once in the string, because then the function returns, but you can hit the other characters many times, which seems to be information a compiler has access to without PGO.
PGO does help, of course, and on my machine gives me 2.80s, which is better than the code at the end of the `Rearranging blocks` section :)
> I assume that their test input (which isn't described in the post, and is also not in their GitHub repo)
Side note: There's a part two to this post (linked at the bottom) where I make the C code as fast as I possibly can, and it beats all the assembly in this post.
I never said writing assembly is (necessarily) a good idea, I just find optimizing it, and deciphering compiler output, an interesting challenge, and a good learning opportunity.
I think I managed to improve on both this post, and its sequel, at the cost of specializing the function for the case of a string made only of 's' and 'p'.
The benchmark only tests strings made of 's' and 'p', so I think it is fair.
The idea is as follow. We want to increase `res` by one when the next character is `s`. Naively, we might try something like this:
res += (c - 'r'); // is `res += 1` when c == 's'
This doesn't work, as `'p' - 'r' == -2`, and we'd need it to be -1.
But `'p' - 'r'`, when viewer as an unsigned integer, underflows, setting the carry flag.
Turns out x64 has an instruction (adc) that adds two registers _plus_ the carry flag.
Therefore we can replace two `cmp, cmov` with one `sub, adc`:
run_switches:
xor eax, eax # res = 0
loop:
movsx ecx, byte ptr [rdi]
test ecx, ecx
je ret
inc rdi
sub ecx, 'r'
adc eax, ecx # Magic happens here
jmp loop
ret:
ret
Benchmarks are as follows (`bench-x64-8` is the asm above):
Summary
'01-six-times-faster-than-c/bench-x64-8 1000 1' ran
1.08 ± 0.00 times faster than '02-the-same-speed-as-c/bench-c-4-clang 1000 1'
1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'
Of course, one could improve things further using SWAR/SIMD...
Even simpler: just sum all elements of the array. Then at the end subtract 'p'*len from the sum, then divide by ('s'-'p') to get the s count. The 'p' count is len minus the 's' count.
The initial sum is easily vectorized as well.
If I've not made any mistakes it should work. Only issue is possible overflow on the running sum.
Can't be bothered to benchmark it though:).
edit: missed the decrement when you see 's'. So the final result is p_count - s_count.
int run_switches(const char *buf) {
size_t len = strlen(buf);
int res = 0;
for (size_t i = 0; i < len; ++i) {
res += (buf[i] == 's') - (buf[i] == 'p');
}
return res;
}
strlen() should be implemented in a pretty fast way, and after the buffer size is known, the compiler can autovectorize the inner loop, which does happen in practice: https://gcc.godbolt.org/z/qYfadPYoq
A while back, I wrote a UTF-8 decoder in Common Lisp, targeting SBCL (it already has one built in, this was an exercise). Pretty much all of the optimization win (after the obvious low-hanging fruit) was structuring the code so that the compiler would generate cmov* instructions rather than branches.
What's some examples of the code changes that you made? And did you just do repeated disassemblies of the functions to see that it was using the correct instructions, or did you do some benchmarking to show your changes were actual improvements?
Branches are prone to be faster than conditional moves if they are correctly predicted, because they do not increase the critical path length. And utf-8 decoders are commonly run on all-ascii input. What were you benchmarking on?
First, before optimizing you should consider correctness and security.
input should be const and the return value should be ssize_t (so you don't have numeric overflow on 64-bit).
Second, consider this replacement function:
ssize_t test(const char \*input) {
ssize_t res = 0;
size_t l = strlen(input);
size_t i;
for (i=0; i<l; ++i) {
res += (input[i] == 's') - (input[i] == 'p');
}
return res;
}
The timings are (using gcc -O3 -march=native): your function 640 cycles, mine 128 cycles. How can that be? I'm reading the memory twice! I have one call to strlen in there, and memory is slow. Shouldn't this be much slower?
No. strlen is a hack that uses vector instructions even though it may technically read beyond the string length. It makes sure not to cross page boundaries so it will not cause adverse reactions, but valgrind needs a suppression exception to not complain about it.
If you know the length beforehand, the compiler can vectorize and unroll the loop, which it happens to do here. To great effect, if I may say so.
The art of writing fast code is usually to get out of the way of the compiler, which will do a perfectly fine job if you let it.
If you really wanted to, you could get rid of the strlen by hacking your logic into what strlen does. That would make the C code much less readable and not actually help that much. My test string is "abcdefghijklmnopqrstuvxyz", so it's all in the l1 cache.
I'm not a compiler expert but if it's a "very simple loop" is it still too complex for the compiler to make good machine code? Did they use a bad compiler on purpose? Or are computers just not yet fast enough to do a good job with very simple loops in practical compilers?
I had an old compilers professor say something like this once. “If you think you can do something better than the C compiler, I promise you you can’t.”
int run_switches(char *input) {
int res = 0;
while (true) {
char c = *input++;
if (c == '\0') return res;
// Here's the trick:
res += (c == 's') - (c == 'p');
}
}
This gives a 3.7x speed compared to loop-1.c. The lower line count is also nice.
Nice. The way I read the cmove version, it's more or less this except the trick line goes
res += (c == 's') ? 1 : (c == 'p') ? -1 : 0
I haven't done C in decades so I don't trust myself to performance test this but I'm curious how it compares. Pretty disappointed that TFA didn't go back and try that in C.
Any guide on how a person who uses Python or JavaScript can learn such things? I mean knowing which assembly code would be better, which algorithm makes better usage of processor etc.? :)
Also, how is such optimization carried out in a large scale software? Like, do you tweak the generated assembly code manually? (Sorry I'm a very very very beginner to low-level code)
You do this by first learning C (or similar languages) and then compilers and maybe also operating systems. What you're seeing in this blog is the equivalent result of at least one or two years university level education, so it's not like there is a single book or tutorial you could use to get you up to speed, especially if you have no previous experience in that area. And building a better compiler optimisation in general is a PhD thesis level task. But it's also not necessary if you want to design user applications on today's hardware.
It's not a useful architecture, but it teaches the thought process really well, and you end up discovering a lot of optimization naturally.
For this article, I'm measuring every step to see what the performance implications of the changes are, which, along with some educated guesses and some googling/reading other articles, was enough for me to figure out what was going on.
In part two (https://owen.cafe/posts/the-same-speed-as-c/) especially, I didn't know what was going on with the benchmarks for a long time. Eventually I got lucky and made a change, which led to a hypothesis, which lead to more tests, which led to a conclusion.
You learn by doing. Compiler Explorer [0] is fantastic for this sort of thing.
Typically you would do this sort of optimisation after profiling and then on a per-function level.
I think it's straightforward to optimize to a point it's maybe about 10x faster than the "optimized" version. The answer is of course SIMD vectorization.
I experimented with different optimizations and ended with 128x speedup. The improvement mainly comes from manual SIMD intrinsics, but you can go a long way just by making the code more auto-vectorization friendly as some other comments have mentioned. See:
Back-of-the-envelope approach that should eliminate most branching:
int table[256] = {0};
void init() {
table['s'] = 1;
table['p'] = -1;
}
int run_switches(char *input, int size) {
int res = 0;
while (size-- >= 0) res += table[input[size]];
return res;
}
Would it be possible to write a code profiler and compiler that work together to optimize code based on real-world data? The profiler would output data that would feed back into the compiler, telling it which branches were selected most often, which would recompile optimizing for the profile. Would this even work? Has it already been done?
I see other people have done minor rewrites, but the post does mention reordering branches, so the obvious question is whether there was any attempt to use PGO, which is an obvious first step in optimization.
A very instructional post. I wish more people had such a level of mastery of GPU assembly and its effects, and would post such treatments on outsmarting NVIDIA's (or AMD's) optimizers.
Close to nobody works on forth compilers nowadays, and the compilers that are optimising or even fast is very small.
People say that forth isn't very optimisable for our register machines, but I reckon that you can get pretty good results with some clever stack analysis. It's actually possible to determine arity statically if you don't have multiple-arity words, which are very rare. That allows you to pass arguments by register.
Anyway, I'm not even close to an expert so don't take what I said as facts.
I made a variant that is (on my Apple m1 machine) 20x faster than the naive C version in the blog by branchlessly processing the string word-by-word:
int run_switches(const char* input) {
int res = 0;
// Align to word boundary.
while ((uintptr_t) input % sizeof(size_t)) {
char c = *input++;
res += c == 's';
res -= c == 'p';
if (c == 0) return res;
}
// Process word-by-word.
const size_t ONES = ((size_t) -1) / 255; // 0x...01010101
const size_t HIGH_BITS = ONES << 7; // 0x...80808080
const size_t SMASK = ONES * (size_t) 's'; // 0x...73737373
const size_t PMASK = ONES * (size_t) 'p'; // 0x...70707070
size_t s_accum = 0;
size_t p_accum = 0;
int iters = 0;
while (1) {
// Load word and check for zero byte.
// (w - ONES) & ~w has the top bit set in each byte where that byte is zero.
size_t w;
memcpy(&w, input, sizeof(size_t));
if ((w - ONES) & ~w & HIGH_BITS) break;
input += sizeof(size_t);
// We reuse the same trick as before, but XORing with SMASK/PMASK first to get
// exactly the high bits set where a byte is 's' or 'p'.
size_t s_high_bits = ((w ^ SMASK) - ONES) & ~(w ^ SMASK) & HIGH_BITS;
size_t p_high_bits = ((w ^ PMASK) - ONES) & ~(w ^ PMASK) & HIGH_BITS;
// Shift down and accumulate.
s_accum += s_high_bits >> 7;
p_accum += p_high_bits >> 7;
if (++iters >= 255 / sizeof(size_t)) {
// To prevent overflow in our byte-wise accumulators we must flush
// them every so often. We use a trick by noting that 2^8 = 1 (mod 255)
// and thus a + 2^8 b + 2^16 c + ... = a + b + c (mod 255).
res += s_accum % 255;
res -= p_accum % 255;
iters = s_accum = p_accum = 0;
}
}
res += s_accum % 255;
res -= p_accum % 255;
// Process tail.
while (1) {
char c = *input++;
res += c == 's';
res -= c == 'p';
if (c == 0) break;
}
return res;
}
Fun fact: the above is still 1.6x slower (on my machine) than the naive two-pass algorithm that gets autovectorized by clang:
int run_switches(const char* input) {
size_t len = strlen(input);
int res = 0;
for (size_t i = 0; i < len; ++i) {
char c = input[i];
res += c == 's';
res -= c == 'p';
}
return res;
}
I assume the M1's SIMD registers are wider/more numerous than just the couple of size_t registers used for the loading/masking/accumulating inner loop in your run_swtches().
You can speedup the code by unrolling your inner loop a few times (try 4x or 8x) - it does mean that your overflow prevention limit is lowered (to a multiple of the unrolled grouping number) and run a few more times. But the speedup offsets the increased bookkeeping.
A version I played with showed increased speed by saving the in-progress accumulation in an array and then doing the final accumulation after the main loop is done. But that may be due to the CPU arch/compiler I'm using.
Almost the same as my SWAR version - which is what you're doing.
But aren't you reading off the end of the buffer in your memcpy(&w...)? Say with an empty input string whose start address is aligned to sizeof(size_t) bytes?
I just passed in the string length since the caller had that info, otherwise you'd scan the whole string again looking for the zero terminator.
If I read it correctly, your implementation might read beyond the end of the buffer, and if it crosses a page boundary into an unmapped page, it will segfault. That's one of the many evils of null terminated strings.
torstenvl|2 years ago
Rewriting the C function as follows got a 5.5x speedup:
Results:414owen|2 years ago
https://owen.cafe/posts/the-same-speed-as-c/
And as others have pointed out, you can tweak the input, then vectorize the algo, if you want to go that route.
I considered this a pedagogical exercise and I sincerely hope nobody will start dropping down to assembly without a very good reason to.
haberman|2 years ago
This statement is true if the jumps are unpredictable. If the jumps are predictable, then jumps will be faster.
Linus had a whole rant about this back in the day, arguing that cmov is not useful if branches are predictable: https://yarchive.net/comp/linux/cmov.html
BoppreH|2 years ago
p1necone|2 years ago
DeathArrow|2 years ago
okaleniuk|2 years ago
The whole point of using C is not to think about the underlying architecture. As soon as you start taking "jumps are a lot slower than conditional arithmetic on x86" into account, you're not writing in C, you're writing in assembly with extra steps :-)
BbzzbB|2 years ago
MisterTea|2 years ago
unknown|2 years ago
[deleted]
nwallin|2 years ago
Now, this requires knowing the size of your string in advance, and maybe you're the sort of C programmer who doesn't keep track of how big your strings are. I'm not your coworker, I don't review your code. Do what you want. But you really really probably shouldn't.
https://godbolt.org/z/rde51zMd8
414owen|2 years ago
It achieves 3.88GiB/s
I intentionally didn't go down the route of vectorizing. I wanted to keep the scope of the problem small, and show off the assembly tips and tricks in the post, but maybe there's potential for a future post, where I pad the input string and vectorize the algorithm :)
shusaku|2 years ago
/* DON’T REFACTOR THIS FOR READABILITY IT WILL SLOW DOWN */
elcritch|2 years ago
jonny_eh|2 years ago
Shouldn't "not" keep track of string length?
Const-me|2 years ago
On my computer, the initial C version runs at 389 MB / second. I haven’t tested the assembly versions, but if they deliver the same 6.2x speedup, would result in 2.4 GB/second here.
Here’s C++ version which for long buffers exceeds 24 GB/second on my computer: https://gist.github.com/Const-me/3ade77faad47f0fbb0538965ae7... That’s 61x speedup compared to the original version, without any assembly, based on AVX2 intrinsics.
xoranth|2 years ago
gavinray|2 years ago
https://en.cppreference.com/w/cpp/experimental/simd
utopcell|2 years ago
bluedevilzn|2 years ago
Sesse__|2 years ago
camel-cdr|2 years ago
dzaima|2 years ago
okaleniuk|2 years ago
However, on other processors, this might not be the case. https://wordsandbuttons.online/using_logical_operators_for_l...
The good question is what do we need C for in general? Of course, we can hand-tailor our code to run best on one particular piece of hardware. And we don't need C for that, it would be the wrong tool. We need assembly (and a decent macro system for some sugar)
But the original goal of C was to make translating system-level code from one platform to another easier. And we're expected to lose efficiency on this operation. It's like instead of writing a poem in Hindi and translating it in Urdu, we write one in Esperanto and then translate to whatever language we want automatically. You don't get two brilliant poems, you only get two poor translations, but you get them fast. That's what C is for.
eklitzke|2 years ago
However, whether or not using cmov is effective compared to a regular test/jump is highly dependent on how predictable the branch is, with cmov typically performing better when the branch is very unpredictable. Since they got a 6x speedup with cmov, I assume that their test input (which isn't described in the post, and is also not in their GitHub repo) consists of random strings consisting almost entirely of s and p characters. There's nothing wrong with this, but it does make the post seem a little misleading to me, as their clever speedup is mostly about exploiting an unmentioned property of the data that is highly specific to their benchmark.
nwallin|2 years ago
test code is here: https://github.com/414owen/blog-code/blob/master/02-the-same... it randomly selects between 's' or 'p'. The characters can't be anything other than 's', 'p', or the terminating null. Knowing that particular fact about our input gives us this ...clever... optimization:
which compiles to: This is too clever by half, of course, but it perfectly illustrates your point about exploiting properties of the data.414owen|2 years ago
So, the maximum amount of times you can hit '\0' is once in the string, because then the function returns, but you can hit the other characters many times, which seems to be information a compiler has access to without PGO.
PGO does help, of course, and on my machine gives me 2.80s, which is better than the code at the end of the `Rearranging blocks` section :)
> I assume that their test input (which isn't described in the post, and is also not in their GitHub repo)
It's described under `Benchmarking setup`, and is in the repository here: https://github.com/414owen/blog-code/blob/master/01-six-time...
Side note: There's a part two to this post (linked at the bottom) where I make the C code as fast as I possibly can, and it beats all the assembly in this post.
I never said writing assembly is (necessarily) a good idea, I just find optimizing it, and deciphering compiler output, an interesting challenge, and a good learning opportunity.
unknown|2 years ago
[deleted]
xoranth|2 years ago
The benchmark only tests strings made of 's' and 'p', so I think it is fair.
The idea is as follow. We want to increase `res` by one when the next character is `s`. Naively, we might try something like this:
This doesn't work, as `'p' - 'r' == -2`, and we'd need it to be -1.But `'p' - 'r'`, when viewer as an unsigned integer, underflows, setting the carry flag. Turns out x64 has an instruction (adc) that adds two registers _plus_ the carry flag.
Therefore we can replace two `cmp, cmov` with one `sub, adc`:
Benchmarks are as follows (`bench-x64-8` is the asm above): Of course, one could improve things further using SWAR/SIMD...gpderetta|2 years ago
The initial sum is easily vectorized as well.
If I've not made any mistakes it should work. Only issue is possible overflow on the running sum.
Can't be bothered to benchmark it though:).
edit: missed the decrement when you see 's'. So the final result is p_count - s_count.
sltkr|2 years ago
aidenn0|2 years ago
whartung|2 years ago
moonchild|2 years ago
fefe23|2 years ago
Second, consider this replacement function:
The timings are (using gcc -O3 -march=native): your function 640 cycles, mine 128 cycles. How can that be? I'm reading the memory twice! I have one call to strlen in there, and memory is slow. Shouldn't this be much slower?No. strlen is a hack that uses vector instructions even though it may technically read beyond the string length. It makes sure not to cross page boundaries so it will not cause adverse reactions, but valgrind needs a suppression exception to not complain about it.
If you know the length beforehand, the compiler can vectorize and unroll the loop, which it happens to do here. To great effect, if I may say so.
The art of writing fast code is usually to get out of the way of the compiler, which will do a perfectly fine job if you let it.
If you really wanted to, you could get rid of the strlen by hacking your logic into what strlen does. That would make the C code much less readable and not actually help that much. My test string is "abcdefghijklmnopqrstuvxyz", so it's all in the l1 cache.
torstenvl|2 years ago
agumonkey|2 years ago
unknown|2 years ago
[deleted]
414owen|2 years ago
ftxbro|2 years ago
bjourne|2 years ago
jtriangle|2 years ago
JohnMakin|2 years ago
BoppreH|2 years ago
svachalek|2 years ago
eru|2 years ago
gavinray|2 years ago
Kept me reading into the follow-up article.
Also, I love the UI of this blog.
414owen|2 years ago
arun-mani-j|2 years ago
Also, how is such optimization carried out in a large scale software? Like, do you tweak the generated assembly code manually? (Sorry I'm a very very very beginner to low-level code)
sigmoid10|2 years ago
dottedmag|2 years ago
414owen|2 years ago
It's not a useful architecture, but it teaches the thought process really well, and you end up discovering a lot of optimization naturally.
For this article, I'm measuring every step to see what the performance implications of the changes are, which, along with some educated guesses and some googling/reading other articles, was enough for me to figure out what was going on.
In part two (https://owen.cafe/posts/the-same-speed-as-c/) especially, I didn't know what was going on with the benchmarks for a long time. Eventually I got lucky and made a change, which led to a hypothesis, which lead to more tests, which led to a conclusion.
secondcoming|2 years ago
[0] godbolt.org
vardump|2 years ago
red2awn|2 years ago
https://ipthomas.com/blog/2023/07/n-times-faster-than-c-wher...
amm|2 years ago
414owen|2 years ago
https://owen.cafe/posts/the-same-speed-as-c/
But taking the length of the string as a parameter is not, because that changes the problem statement (making the solution vectorizable)
Also note that you'll try to read element -1 of the input. You probably want to change the `>=` to a `>`
lukas099|2 years ago
garenp|2 years ago
olliej|2 years ago
einpoklum|2 years ago
failuser|2 years ago
Such bloated ISA like x86 might actually handle predicate support, but who will try such a radical change?
gpderetta|2 years ago
Also the original ARM 32 bit instruction sent had extensive predication.
sitkack|2 years ago
rajnathani|2 years ago
RobotToaster|2 years ago
414owen|2 years ago
kristianpaul|2 years ago
stefncb|2 years ago
People say that forth isn't very optimisable for our register machines, but I reckon that you can get pretty good results with some clever stack analysis. It's actually possible to determine arity statically if you don't have multiple-arity words, which are very rare. That allows you to pass arguments by register.
Anyway, I'm not even close to an expert so don't take what I said as facts.
throwaway14356|2 years ago
orlp|2 years ago
fuber2018|2 years ago
You can speedup the code by unrolling your inner loop a few times (try 4x or 8x) - it does mean that your overflow prevention limit is lowered (to a multiple of the unrolled grouping number) and run a few more times. But the speedup offsets the increased bookkeeping.
A version I played with showed increased speed by saving the in-progress accumulation in an array and then doing the final accumulation after the main loop is done. But that may be due to the CPU arch/compiler I'm using.
fuber2018|2 years ago
But aren't you reading off the end of the buffer in your memcpy(&w...)? Say with an empty input string whose start address is aligned to sizeof(size_t) bytes?
I just passed in the string length since the caller had that info, otherwise you'd scan the whole string again looking for the zero terminator.
gpderetta|2 years ago
vpastore|2 years ago
[deleted]
anthony88|2 years ago
[deleted]
unknown|2 years ago
[deleted]