I discovered a really cool ARM64 trick today. One thing about x86 that I've found useful on so many occasions is the PCMPEQB + PMOVMSKB + BSF trick that lets me scan the bytes of a string 10x faster. I couldn't find any information on Google for doing PMOVMSKB with ARM, so I've been studying ARM's "Optimized Routines" codebase where I stumbled upon the answer in their strnlen() implementation. It turns out the trick is to use `shrn dst.8b, src.8h, 4` which turns a 128-bit mask into a 64-bit mask. You can then get the string offset index with fmov, rbit, clz and finally shift by 2.
In my experience using a 512 wide movemask (to uint64_t) is the fastest on both x86 and arm64. (Edit: just yo clarify, I meant the fastest for iteration, things like SwissMap are better off using 128 wide movemask)
With rvv you don't really what to go from a vector mask to a general purpose non vector register, because the vector length may vary. But I found it really useful that vector masks are always packed into v0. So even with LMUL=8, you can just to a vmseq, switch to LMUL=1 and use vfirst & vmsif & vmandn to iterate through all indices. (Alternatively vfirst & vmsof & vmclr would also work, I'm not sure which one would be faster)
I am very surprised that this is presented as something new. From the very beginning of ARM, all instructions have had a condition attached to them. Contrary to the article, it has absolutely nothing to do with making the processor more CISCy, but is instead one of its most RISCy aspects.
All 32-bit ARM opcodes had predication, but when ARM went 64-bit, they wanted to recover the encoding space for 32 instead of 16 registers, and removed predication from most instructions. When they did this, they looked at all the 32-bit ARM binaries they could find, and counted which instructions were actually used with predicates, and added the top 5 of those as separate instructions.
Yes, I had similar thoughts when I started reading, but I think only ARM32 has predication. (There's a prefix-instruction-based something or other in Thumb, I think, but it doesn't devote part of the encoding space to predication bits like ARM32 does.)
As I understand it they didn't carry predication across from ARM32 to ARM64 for various performance reasons (if you want to be able to re-order instructions, or even agressively pipeline them, you don't want them depending on the result of the immediately-prior instructon).
Predication everywhere (i.e. orthogonal to the rest of the instruction set, and not special-cased) is certainly more RISC than CISC - but having removed it in general, bringing it back for a few specific instructions is arguably CISCy.
I thought this was interesting, although of course I agree with many commenters' take that the lack of reference to the "old-school" ARM where everything was conditional is odd.
I got curious about how RISC-V handles this, but only curious enough to find [1] and not dig any further. That answer is from a year ago, so perhaps there have been changes.
"cmov" and several more interesting instructions in the draft RISC-V Bitmanip proposal were dropped before it reached 1.0 though.
There is a new proposal: Zicond, but it is quite crude, with two instructions. The "czero.eqz" instruction does:
rd = (rs2 == 0) ? 0 : rs1;
And the other "czero.nez" tests for "rs2 != 0". Both are supposed to be result in an operand for another instruction, where a zero operand makes it a nop: for conditional add,sub,xor, etc.
Conditional move, however, takes three instructions: two results where either is zero which get or'ed together.
Otherwise, the intention was that bigger RISC-V cores would detect a conditional branch over a single instruction in the decoder and perform macro-op fusion into a conditional instruction.
The while loop in the third paragraph is easier to read in assembly than in the original C++, which either says something about how well chosen the instruction set is, or about how bad some of C++ is.
Nothing to do with C++ - it's a plain C code as a matter of fact but that's not important at all. What the code does is that it employs low-level intrinsic knowledge about the CPU microarchitecture (x86-64) and compiler codegen ability (clang) so that they can pack as many instructions per cycle as they can so that the resulting (de)compression speed is improved. You cannot write such piece of code so that it looks "beautiful" to an average Joe.
It’s weirdly written, maybe to mimic conditional machine instructions. It’s also unusual in that it seems to assume that each input array contains each number only once, as it outputs numbers contained in both input arrays only once, but only under that prior assumption.
Wouldn't this be the ideal instruction for implementing multi-word arithmetic? If the carry flag is set from the previous (lower order) addition, increase the next word up by one and continue adding.
And of course ARM 32 had conditional execution for all instructions. These appear the variants that were useful enough to keep around when the general feature was removed from aarch64
ARM has both add-with-carry and add-without-carry instructions, a separate increment is not necessary. (I don't know much about AArch64, only ancient ARM2/3, but I expect they left this in).
ARM used to have the beautiful UMALL, a single instruction that would multiply two registers then accumulate two other values into the result, then store as a double word into the registers.
This is the inner loop of multiplication and was very nice to use, but died in the AArch64 transition.
You have to be careful with turning control dependencies into data dependencies. It can be very hard to understand or predict how a CPU will behave.
If you are testing quite predictable things, you almost always want to use branch prediction and not predicated/conditional instructions.
If something is totally unpredictable, let's say a binary search that is looking up random elements in a well balanced heap or tree. Each comparison is very unpredictable. A conditional select would work best there:
You could do your tree walks entirely without branch misses if that first line was a select... But it turns out that is not true. Or it's not necessarily true, depending a few (not uncommon) factors, it can be worse to use a select there.
How does software these days target all the different CPUs with different instructions?
If I download, say, debian-11.7.0-amd64-netinst.iso - does it somehow dynamically adapt to all the different AMD and Intel CPUs and uses the instructions available on the users machine?
Software compiled to be "portable" uses a reduced subset. You actually have to bully GCC into using the full CPU instruction set with -march=native (you can also put another target CPU arch there).
In short, distributed binaries tend to use "least common denominator" instructions.
I believe one of the pros touted of Gentoo, where everything is compiled locally, is that all the software uses the CPU to it's fullest potential.
There are several uarch levels defined for x86_64 which include newer instructions than the baseline. Some distros are starting to move to use those higher levels, notably RHEL9 is x86_64-v2.
For a while, submissions to the iOS app store could include bitcode, which was LLVM's intermediate byte code. I don't know if they ever did, but Apple could generate architecture-optimized binaries for their various CPU models. They deprecated that last year, though.
.Net ahead-of-time compilation (that is, compiling the .net / clr VM byte code into something your CPU can run directly) could (but apparently doesn't?) do CPU-specific optimizations. The JIT compiler, however does do some CPU-specific optimizations.
Compiler flags. You turn on/off compiler optimisations for target architectures that are aware of all the instruction-set specific hardware level optimisations.
It looks like the reason this apparently weird instruction exists is that AArch64 has a zero register, meaning you can use csinc with two zero register operands to represent cond ? 1 : 0.
Given that AArch64 has/had no 16-bit instruction support, it probably made sense to provide a generalization of a setcond instruction to make use of the encoding space of 32-bit instructions, and that's one of the most obvious (the other ones being cond ? imm : 0 or cond ? imm : reg).
I wonder how long will it take for all the software to mature to fully be able to use full performance of today’s hardware. I mean all the optimizations in language compilers, OSes and such. 50 years? 1 year after the first AGI coder?
Although ARM is marketed as RISC, it does have a lot of CISC-like features. I suspect the designers knew that with fixed-size instructions, they had to pack as much as they could into them to increase code density.
I think it is very much in the RISC philosophy to have fewer more powerful, but still simple, instructions which can be combined with operands in complex ways to do a lot of different things.
Another example of this are all the combinations with the hard-coded zero register. For instance, the `cmp` "instruction" in A64 (and many other RISC ISAs) is actually an alias to the `subs` (subtract and set status flags) instruction with the zero-register as destination.
The idea of the zero register was so potent that modern CISC x86 processors actually have a physical zero register internally, which olde x86 instructions are translated into using.
The ARM has lots of instructions, each fairly simple. Compare this to an architecture where a single instruction can ① compute the address of its operands in main memory, ② read them, ③ carry out its main operation and eventually ④ write the result to main memory, with most of those steps optional and depending on the arguments supplied.
[+] [-] jart|2 years ago|reply
[+] [-] danlark|2 years ago|reply
You can read about it in https://community.arm.com/arm-community-blogs/b/infrastructu...
[+] [-] camel-cdr|2 years ago|reply
In my experience using a 512 wide movemask (to uint64_t) is the fastest on both x86 and arm64. (Edit: just yo clarify, I meant the fastest for iteration, things like SwissMap are better off using 128 wide movemask)
With rvv you don't really what to go from a vector mask to a general purpose non vector register, because the vector length may vary. But I found it really useful that vector masks are always packed into v0. So even with LMUL=8, you can just to a vmseq, switch to LMUL=1 and use vfirst & vmsif & vmandn to iterate through all indices. (Alternatively vfirst & vmsof & vmclr would also work, I'm not sure which one would be faster)
[+] [-] gpvos|2 years ago|reply
[+] [-] Tuna-Fish|2 years ago|reply
[+] [-] robinsonb5|2 years ago|reply
As I understand it they didn't carry predication across from ARM32 to ARM64 for various performance reasons (if you want to be able to re-order instructions, or even agressively pipeline them, you don't want them depending on the result of the immediately-prior instructon).
Predication everywhere (i.e. orthogonal to the rest of the instruction set, and not special-cased) is certainly more RISC than CISC - but having removed it in general, bringing it back for a few specific instructions is arguably CISCy.
[+] [-] reformedposter|2 years ago|reply
[deleted]
[+] [-] unwind|2 years ago|reply
I got curious about how RISC-V handles this, but only curious enough to find [1] and not dig any further. That answer is from a year ago, so perhaps there have been changes.
[1]: https://stackoverflow.com/a/72341794/28169
[+] [-] Findecanor|2 years ago|reply
There is a new proposal: Zicond, but it is quite crude, with two instructions. The "czero.eqz" instruction does:
And the other "czero.nez" tests for "rs2 != 0". Both are supposed to be result in an operand for another instruction, where a zero operand makes it a nop: for conditional add,sub,xor, etc. Conditional move, however, takes three instructions: two results where either is zero which get or'ed together.https://github.com/riscv/riscv-zicond/blob/main/zicondops.ad...
Otherwise, the intention was that bigger RISC-V cores would detect a conditional branch over a single instruction in the decoder and perform macro-op fusion into a conditional instruction.
[+] [-] stefan_|2 years ago|reply
[+] [-] franky47|2 years ago|reply
https://en.wikipedia.org/wiki/Sinc_function
[+] [-] t8sr|2 years ago|reply
[+] [-] menaerus|2 years ago|reply
[+] [-] layer8|2 years ago|reply
[+] [-] mtklein|2 years ago|reply
[+] [-] dougall|2 years ago|reply
[+] [-] mpweiher|2 years ago|reply
And of course ARM 32 had conditional execution for all instructions. These appear the variants that were useful enough to keep around when the general feature was removed from aarch64
[+] [-] gpvos|2 years ago|reply
[+] [-] wbl|2 years ago|reply
This is the inner loop of multiplication and was very nice to use, but died in the AArch64 transition.
[+] [-] throwawaylinux|2 years ago|reply
If you are testing quite predictable things, you almost always want to use branch prediction and not predicated/conditional instructions.
If something is totally unpredictable, let's say a binary search that is looking up random elements in a well balanced heap or tree. Each comparison is very unpredictable. A conditional select would work best there:
You could do your tree walks entirely without branch misses if that first line was a select... But it turns out that is not true. Or it's not necessarily true, depending a few (not uncommon) factors, it can be worse to use a select there.[+] [-] TekMol|2 years ago|reply
If I download, say, debian-11.7.0-amd64-netinst.iso - does it somehow dynamically adapt to all the different AMD and Intel CPUs and uses the instructions available on the users machine?
[+] [-] bruce343434|2 years ago|reply
In short, distributed binaries tend to use "least common denominator" instructions.
I believe one of the pros touted of Gentoo, where everything is compiled locally, is that all the software uses the CPU to it's fullest potential.
[+] [-] aleden|2 years ago|reply
see https://github.com/bminor/glibc/blob/glibc-2.31/sysdeps/x86_...
[+] [-] r2vcap|2 years ago|reply
[+] [-] zokier|2 years ago|reply
You'll find lots of discussions happening around this topic, for example: https://www.phoronix.com/news/Arch-Linux-x86-64-v3-Port-RFC
[+] [-] zX41ZdbW|2 years ago|reply
[+] [-] ksherlock|2 years ago|reply
.Net ahead-of-time compilation (that is, compiling the .net / clr VM byte code into something your CPU can run directly) could (but apparently doesn't?) do CPU-specific optimizations. The JIT compiler, however does do some CPU-specific optimizations.
[+] [-] lnx01|2 years ago|reply
[+] [-] kramerger|2 years ago|reply
I think that is the reason GCC will not use it, although it may if you set the target CPU with -mcpu=
[+] [-] monocasa|2 years ago|reply
[+] [-] devit|2 years ago|reply
Given that AArch64 has/had no 16-bit instruction support, it probably made sense to provide a generalization of a setcond instruction to make use of the encoding space of 32-bit instructions, and that's one of the most obvious (the other ones being cond ? imm : 0 or cond ? imm : reg).
[+] [-] exabrial|2 years ago|reply
10/10 on the website. Clean simple design and doesn't download 4,124 javascript libraries for the purpose of displaying static content.
[+] [-] Aerbil313|2 years ago|reply
[+] [-] userbinator|2 years ago|reply
[+] [-] userbinator|2 years ago|reply
[+] [-] msla|2 years ago|reply
Anyway, here's John Mashey, who helped design the MIPS, on RISC v CISC:
https://yarchive.net/comp/risc_definition.html
[+] [-] terrelln|2 years ago|reply
Looking forward to your future posts!
[+] [-] tempodox|2 years ago|reply
[+] [-] vardump|2 years ago|reply
[+] [-] thriftwy|2 years ago|reply
[+] [-] Findecanor|2 years ago|reply
Another example of this are all the combinations with the hard-coded zero register. For instance, the `cmp` "instruction" in A64 (and many other RISC ISAs) is actually an alias to the `subs` (subtract and set status flags) instruction with the zero-register as destination. The idea of the zero register was so potent that modern CISC x86 processors actually have a physical zero register internally, which olde x86 instructions are translated into using.
[+] [-] Arnt|2 years ago|reply
The ARM has lots of instructions, each fairly simple. Compare this to an architecture where a single instruction can ① compute the address of its operands in main memory, ② read them, ③ carry out its main operation and eventually ④ write the result to main memory, with most of those steps optional and depending on the arguments supplied.