“csinc”, the AArch64 instruction you didn’t know you wanted

[+] jart|2 years ago|reply

I discovered a really cool ARM64 trick today. One thing about x86 that I've found useful on so many occasions is the PCMPEQB + PMOVMSKB + BSF trick that lets me scan the bytes of a string 10x faster. I couldn't find any information on Google for doing PMOVMSKB with ARM, so I've been studying ARM's "Optimized Routines" codebase where I stumbled upon the answer in their strnlen() implementation. It turns out the trick is to use `shrn dst.8b, src.8h, 4` which turns a 128-bit mask into a 64-bit mask. You can then get the string offset index with fmov, rbit, clz and finally shift by 2.

[+] danlark|2 years ago|reply

I am the author of this trick as well

You can read about it in https://community.arm.com/arm-community-blogs/b/infrastructu...

[+] camel-cdr|2 years ago|reply

I found the following article about the topic really good: https://branchfree.org/2019/04/01/fitting-my-head-through-th...

In my experience using a 512 wide movemask (to uint64_t) is the fastest on both x86 and arm64. (Edit: just yo clarify, I meant the fastest for iteration, things like SwissMap are better off using 128 wide movemask)

With rvv you don't really what to go from a vector mask to a general purpose non vector register, because the vector length may vary. But I found it really useful that vector masks are always packed into v0. So even with LMUL=8, you can just to a vmseq, switch to LMUL=1 and use vfirst & vmsif & vmandn to iterate through all indices. (Alternatively vfirst & vmsof & vmclr would also work, I'm not sure which one would be faster)

[+] gpvos|2 years ago|reply

I am very surprised that this is presented as something new. From the very beginning of ARM, all instructions have had a condition attached to them. Contrary to the article, it has absolutely nothing to do with making the processor more CISCy, but is instead one of its most RISCy aspects.

[+] Tuna-Fish|2 years ago|reply

All 32-bit ARM opcodes had predication, but when ARM went 64-bit, they wanted to recover the encoding space for 32 instead of 16 registers, and removed predication from most instructions. When they did this, they looked at all the 32-bit ARM binaries they could find, and counted which instructions were actually used with predicates, and added the top 5 of those as separate instructions.

[+] robinsonb5|2 years ago|reply

Yes, I had similar thoughts when I started reading, but I think only ARM32 has predication. (There's a prefix-instruction-based something or other in Thumb, I think, but it doesn't devote part of the encoding space to predication bits like ARM32 does.)

As I understand it they didn't carry predication across from ARM32 to ARM64 for various performance reasons (if you want to be able to re-order instructions, or even agressively pipeline them, you don't want them depending on the result of the immediately-prior instructon).

Predication everywhere (i.e. orthogonal to the rest of the instruction set, and not special-cased) is certainly more RISC than CISC - but having removed it in general, bringing it back for a few specific instructions is arguably CISCy.

[+] reformedposter|2 years ago|reply

[deleted]

[+] unwind|2 years ago|reply

I thought this was interesting, although of course I agree with many commenters' take that the lack of reference to the "old-school" ARM where everything was conditional is odd.

I got curious about how RISC-V handles this, but only curious enough to find [1] and not dig any further. That answer is from a year ago, so perhaps there have been changes.

[1]: https://stackoverflow.com/a/72341794/28169

[+] Findecanor|2 years ago|reply

"cmov" and several more interesting instructions in the draft RISC-V Bitmanip proposal were dropped before it reached 1.0 though.

There is a new proposal: Zicond, but it is quite crude, with two instructions. The "czero.eqz" instruction does:

  rd = (rs2 == 0) ? 0 : rs1;

And the other "czero.nez" tests for "rs2 != 0". Both are supposed to be result in an operand for another instruction, where a zero operand makes it a nop: for conditional add,sub,xor, etc. Conditional move, however, takes three instructions: two results where either is zero which get or'ed together.

https://github.com/riscv/riscv-zicond/blob/main/zicondops.ad...

Otherwise, the intention was that bigger RISC-V cores would detect a conditional branch over a single instruction in the decoder and perform macro-op fusion into a conditional instruction.

[+] stefan_|2 years ago|reply

Not quite cmov but Alibabas T-Head extensions have mveqz (move if equal zero) and mvnez (move if not equal zero).

[+] franky47|2 years ago|reply

Before reading the article, my former DSP engineer brain kicked in and thought: "complex cardinal sine (sinc), why would you want that?"

https://en.wikipedia.org/wiki/Sinc_function

[+] t8sr|2 years ago|reply

The while loop in the third paragraph is easier to read in assembly than in the original C++, which either says something about how well chosen the instruction set is, or about how bad some of C++ is.

[+] menaerus|2 years ago|reply

Nothing to do with C++ - it's a plain C code as a matter of fact but that's not important at all. What the code does is that it employs low-level intrinsic knowledge about the CPU microarchitecture (x86-64) and compiler codegen ability (clang) so that they can pack as many instructions per cycle as they can so that the resulting (de)compression speed is improved. You cannot write such piece of code so that it looks "beautiful" to an average Joe.

[+] layer8|2 years ago|reply

It’s weirdly written, maybe to mimic conditional machine instructions. It’s also unusual in that it seems to assume that each input array contains each number only once, as it outputs numbers contained in both input arrays only once, but only under that prior assumption.

[+] mtklein|2 years ago|reply

I love seeing this instruction pop up in disassembly. I've seen it come up when growing a dynamic array, with some C code like...

    if (is_pow2_or_zero(len)) {
        int grown = len ? len*2 : 1; 
        ptr = realloc(ptr, (size_t)grown * sizeof *ptr);
    }

compiling into this sort of disassembly to calculate the value of grown:

    lsl    w8, w19, #1      // w8 = len*2
    cmp    w19, #0x0        // is len zero?
    csinc  w8, w8, wzr, ne  // w8 = (w8 if len != 0) or (0+1 if len == 0)

Pretty clever to create that 1 constant using csinc on the wzr zero register.

[+] dougall|2 years ago|reply

Though it'd be preferable to do:

    cmp wzr, w19      // set the carry flag if w19 is zero
    adc w8, w19, w19  // w8 = w19 + w19 + carry

[+] mpweiher|2 years ago|reply

Wouldn't this be the ideal instruction for implementing multi-word arithmetic? If the carry flag is set from the previous (lower order) addition, increase the next word up by one and continue adding.

And of course ARM 32 had conditional execution for all instructions. These appear the variants that were useful enough to keep around when the general feature was removed from aarch64

[+] gpvos|2 years ago|reply

ARM has both add-with-carry and add-without-carry instructions, a separate increment is not necessary. (I don't know much about AArch64, only ancient ARM2/3, but I expect they left this in).

[+] wbl|2 years ago|reply

ARM used to have the beautiful UMALL, a single instruction that would multiply two registers then accumulate two other values into the result, then store as a double word into the registers.

This is the inner loop of multiplication and was very nice to use, but died in the AArch64 transition.

[+] throwawaylinux|2 years ago|reply

You have to be careful with turning control dependencies into data dependencies. It can be very hard to understand or predict how a CPU will behave.

If you are testing quite predictable things, you almost always want to use branch prediction and not predicated/conditional instructions.

If something is totally unpredictable, let's say a binary search that is looking up random elements in a well balanced heap or tree. Each comparison is very unpredictable. A conditional select would work best there:

    item = (val < item->val ? item->left : item->right);
    if (val == item->val) ...

You could do your tree walks entirely without branch misses if that first line was a select... But it turns out that is not true. Or it's not necessarily true, depending a few (not uncommon) factors, it can be worse to use a select there.

[+] TekMol|2 years ago|reply

How does software these days target all the different CPUs with different instructions?

If I download, say, debian-11.7.0-amd64-netinst.iso - does it somehow dynamically adapt to all the different AMD and Intel CPUs and uses the instructions available on the users machine?

[+] bruce343434|2 years ago|reply

Software compiled to be "portable" uses a reduced subset. You actually have to bully GCC into using the full CPU instruction set with -march=native (you can also put another target CPU arch there).

In short, distributed binaries tend to use "least common denominator" instructions.

I believe one of the pros touted of Gentoo, where everything is compiled locally, is that all the software uses the CPU to it's fullest potential.

[+] aleden|2 years ago|reply

IFunc relocations are how glibc dynamically chooses the best memcpy routine to use at runtime based on the CPU.

see https://github.com/bminor/glibc/blob/glibc-2.31/sysdeps/x86_...

[+] r2vcap|2 years ago|reply

In many cases, multiple implementations are included and one is chosen to utilize the best instruction supported by the CPU. Example code: https://source.chromium.org/chromium/chromium/src/+/main:thi...

[+] zokier|2 years ago|reply

There are several uarch levels defined for x86_64 which include newer instructions than the baseline. Some distros are starting to move to use those higher levels, notably RHEL9 is x86_64-v2.

You'll find lots of discussions happening around this topic, for example: https://www.phoronix.com/news/Arch-Linux-x86-64-v3-Port-RFC

[+] zX41ZdbW|2 years ago|reply

Runtime CPU dispatching: https://maksimkita.com/blog/cpu-dispatch-in-clickhouse.html

[+] ksherlock|2 years ago|reply

For a while, submissions to the iOS app store could include bitcode, which was LLVM's intermediate byte code. I don't know if they ever did, but Apple could generate architecture-optimized binaries for their various CPU models. They deprecated that last year, though.

.Net ahead-of-time compilation (that is, compiling the .net / clr VM byte code into something your CPU can run directly) could (but apparently doesn't?) do CPU-specific optimizations. The JIT compiler, however does do some CPU-specific optimizations.

[+] lnx01|2 years ago|reply

Compiler flags. You turn on/off compiler optimisations for target architectures that are aware of all the instruction-set specific hardware level optimisations.

[+] kramerger|2 years ago|reply

I am not 100% convinced this will perform as good on every armv8 implementation. Have you tried this on first gen v8 cores such as A53?

I think that is the reason GCC will not use it, although it may if you set the target CPU with -mcpu=

[+] monocasa|2 years ago|reply

Conditional moves tend to work even better on small in-order designs than later OoOE cores.

[+] devit|2 years ago|reply

It looks like the reason this apparently weird instruction exists is that AArch64 has a zero register, meaning you can use csinc with two zero register operands to represent cond ? 1 : 0.

Given that AArch64 has/had no 16-bit instruction support, it probably made sense to provide a generalization of a setcond instruction to make use of the encoding space of 32-bit instructions, and that's one of the most obvious (the other ones being cond ? imm : 0 or cond ? imm : reg).

[+] exabrial|2 years ago|reply

Side Note:

10/10 on the website. Clean simple design and doesn't download 4,124 javascript libraries for the purpose of displaying static content.

[+] Aerbil313|2 years ago|reply

I wonder how long will it take for all the software to mature to fully be able to use full performance of today’s hardware. I mean all the optimizations in language compilers, OSes and such. 50 years? 1 year after the first AGI coder?

[+] userbinator|2 years ago|reply

Look at the demoscene. They're still exploring the limits of the C64 (1MHz 6502, 64k RAM).

[+] userbinator|2 years ago|reply

Although ARM is marketed as RISC, it does have a lot of CISC-like features. I suspect the designers knew that with fixed-size instructions, they had to pack as much as they could into them to increase code density.

[+] msla|2 years ago|reply

What would be CISC-like is if the opcode operated on memory locations, such that the CPU would have to deal with it taking a page fault.

Anyway, here's John Mashey, who helped design the MIPS, on RISC v CISC:

https://yarchive.net/comp/risc_definition.html

[+] terrelln|2 years ago|reply

Awesome post, TIL about that instruction. I just found myself wanting a `csinc` instruction when optimizing a function to merge sorted lists.

Looking forward to your future posts!

[+] tempodox|2 years ago|reply

Too bad, I thought it was about computing the complex sinus function.

[+] vardump|2 years ago|reply

So, a very useful and versatile instruction. Glad AArch64 got it.

[+] thriftwy|2 years ago|reply

ARM was supposed to be RISC but this sounds BISC - baroque instructions set computer.

[+] Findecanor|2 years ago|reply

I think it is very much in the RISC philosophy to have fewer more powerful, but still simple, instructions which can be combined with operands in complex ways to do a lot of different things.

Another example of this are all the combinations with the hard-coded zero register. For instance, the `cmp` "instruction" in A64 (and many other RISC ISAs) is actually an alias to the `subs` (subtract and set status flags) instruction with the zero-register as destination. The idea of the zero register was so potent that modern CISC x86 processors actually have a physical zero register internally, which olde x86 instructions are translated into using.

[+] Arnt|2 years ago|reply

You've no idea.

The ARM has lots of instructions, each fairly simple. Compare this to an architecture where a single instruction can ① compute the address of its operands in main memory, ② read them, ③ carry out its main operation and eventually ④ write the result to main memory, with most of those steps optional and depending on the arguments supplied.

96 comments