top | item 46107200

(no title)

eb0la | 3 months ago

I remember a lot of code zeroing registrers, dating at least back from the IBM PC XT days (before the 80286).

If you decode the instruction, it makes sense to use XOR:

- mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

This extra byte in a machine with less than 1 Megabyte of memory did id matter.

In 386 processors it was also - mov eax,0 - needs 5 bytes (b8 00 00 00 00) - xor eax,eax - needs 2 bytes (31 c0)

Here Intel made the decision to use only 2 bytes. I bet this helps both the instruction decoder and (of course) saves more memory than the old 8086 instruction.

discuss

Sharlin|3 months ago

As the author says, a couple of extra bytes still matter, perhaps more than 20ish years ago. There are vast amounts of RAM, sure, but it's glacially slow, and there's only a few tens of kBs of L1 instruction cache.

Never mind the fact that, as the author also mentions, the xor idiom takes essentially zero cycles to execute because nothing actually happens besides assigning a new pre-zeroed physical register to the logical register name early on in the pipeline, after which the instruction is retired.

umanwizard|3 months ago

> nothing actually happens besides assigning a new pre-zeroed physical register to the logical register name early on in the pipeline, after which the instruction is retired.

This is slightly inaccurate -- instructions retire in order, so it doesn't necessarily retire immediately after it's decoded and the new zeroed register is assigned. It has to sit in the reorder buffer waiting until all the instructions ahead of it are retired as well.

Thus in workloads where reorder buffer size is a bottleneck, it could contribute to that. However I doubt this describes most workloads.

cogman10|3 months ago

L1 instruction cache is backed by L2 and L3 caches.

For the AMD 9950, we are talking about 1280kb of L1 (per core). 16MB of L2 (per core) and 64MB of L3 (shared, 128 if you have the X3D version).

I won't say it doesn't matter, but it doesn't matter as much as it once did. CPU caches have gotten huge while the instructions remain the same size.

The more important part, at this point, is it's idiomatic. That means hardware designers are much more likely to put in specialty logic to make sure it's fast. It's a common enough operation to deserve it's own special cases. You can fit a lot of 8 byte instructions into 1280kb of memory. And as it turns out, it's pretty common for applications to spend a lot of their time in small chunks of instructions. The slow part of a lot of code will be that `for loop` with the 30 AVX instructions doing magic. That's why you'll often see compilers burn `NOP` instructions to align a loop. That's to avoid splitting a cache line.

vardump|3 months ago

> - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

You don't need operand size prefix 0x66 when running 16 bit code in Real Mode. So "mov ax, 0" is 3 bytes and "xor ax, ax" is just 2 bytes.

eb0la|3 months ago

My fault: I just compiled the instruction with an assembler instead of looking up the actual instruction from documentation.

It makes much more sense: resetting ax, and bc (xor ax,ax ; xor bx,bx) will be 4 octets, DWORD aligned, and a bit faster to fetch by the x86 than the 3-octet version I wrote before.

Someone|3 months ago

> If you decode the instruction, it makes sense to use XOR:

> - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

Except, apparently, on the pentium Pro, according to this comment: https://randomascii.wordpress.com/2012/12/29/the-surprising-..., which says:

“But there was at least one out-of-order design that did not recognize xor reg, reg as a special case: the Pentium Pro. The Intel Optimization manuals for the Pentium Pro recommended “mov” to zero a register.”

qingcharles|3 months ago

That's weird, I looked it up earlier and found the P6 (Pentium Pro) was the first to actually make the xor optimization into a zero clock operation.

https://fanael.github.io/archives/topic-microarchitecture-ar...

Anarch157a|3 months ago

I don't know enough of the 8086 so I don't know if this works the same, but on the Z80 (which means it was probably true for the 8080 too), XOR A would also clear pretty much all bits on the flag register, meaning the flags would be in a known state before doing something that could affect them.

vanderZwan|3 months ago

Which I guess is the same reason why modern Intel CPU pipelines can rely on it for pipelining.

RHSeeger|3 months ago

> the IBM PC XT days (before the 80286)

Fun fact - the IBM PC XT also came in a 286 model (the XT 286).

eb0la|3 months ago

You're right. I forgot that!

chasd00|3 months ago

> - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

iirc doesn't word alignment matter? I have no idea if this is how the IBM PC XT was aligned but if you had 4 byte words then it doesn't matter if you save a byte with xor because you wouldn't be able to use it for anything else anyway. again, iirc.

Narishma|2 months ago

No, the 8088 used in the PC has a 2 byte word size. More importantly, it only has an 8-bit data bus, so alignment didn't really matter because it fetched instructions one byte at a time.