top | item 29109283

(no title)

philh | 4 years ago

Author in the old thread (https://news.ycombinator.com/item?id=19262249) says

> An x86-64 CPU has sixteen integer registers, but 100-200 physical integer registers. Every time an instruction writes to, say, RAX the renamer chooses an available physical register and does the write to it, recording the fact that RAX is now physical-register #137. This allows the breaking of dependency chains, thus allowing execution parallelism.

I'm curious why they have so many more physical registers than... logical? registers. I have a couple of guesses:

* Physical registers are physically cheaper to add than logical registers.

* Adding logical registers breaks backwards compatibility, or at best means you get no speedup on things written (/compiled) for fewer logical registers. Adding physical registers lets you improve performance without recompiling.

* Adding logical registers increases complexity for people writing assembly and/or compilers. Adding physical registers moves that complexity to people designing CPUs.

Are some of these correct? Other reasons I'm missing?

discuss

ridiculous_fish|4 years ago

Don't think of %eax as a real register. Think of it as a tag in a compressed dataflow graph. The compression is performed by the compiler's register allocator, and the decompression is performed by the CPU's register renaming.

A compiler's IR is a directed graph, where nodes are instructions and tagged by an assigned register. It would be pleasant to assign each node a distinct register, but then machine instructions would be unacceptably large. So the compiler's register allocator compresses the graph, by finding nodes that do not interfere and assigning them the same register.

The CPU's register renamer then reinflates this graph, by inspecting the dataflow between instructions. If two instructions share a register tag, but the second instruction has no dependence on the first, then they may be assigned different physical registers.

`xor eax, eax` has no dependence on any instruction, so it can be specially recognized as allocating a new physical register. In this way of thinking, `xor eax, eax` doesn't zero anything, but is like malloc: it produces a fresh place to read/write, that doesn't alias anything else.

brucedawson|4 years ago

Adding more logical registers is compatibility breaking. And, since you have to encode the register specifier in the instruction it means larger instructions (hence the compatibility breaking) which makes reading and decoding instructions slower.

And, regardless of how many logical registers you have you need to have more physical registers. These are needed for out-of-order (OOO) execution and speculative execution. An OOO super-scalar speculative CPU can have hundreds of instructions in flight and these instructions all need physical registers to work on. If you don't have excess physical registers you can't do OOO or speculative execution.

CalChris|4 years ago

Adding more logical registers doesn't have to break application compatibility. Intel added x87, MMX ... extensions with their new register sets all without breaking compatibility. They even doubled the integer register set in x86_64 with the REX prefix. New programs could use these features and their register sets without existing programs being broken.

What register renaming allows is to increase the performance of both new and existing programs, which is no mean feat. It allows the CPU scheduler to search for more out-of-order parallelism rather than relying on the compiler to find in-order parallelism.

This binary compatibility doesn't seem very important now, don't break userspace excepted, but it was then. Compatibility made IBM and Intel hundreds of billions of dollars.

CalChris|4 years ago

> I'm curious why they have so many more physical registers than... logical? registers

This goes back to 1967 and to the IBM 360/91 with its 4 floating point registers. That's not many registers but Moore's law was making more transistors available. The problem was how to use these transistors to get more throughput from existing programs without changing the ISA and (potentially) breaking compatibility.

The solution was Tomasulo's algorithm [2] which allowed (few) architectural registers to be renamed to (many) physical registers.

  original         renamed           reordered
  mov RAX, 1       mov PHYS1, 1      mov PHYS1, 1; mov RAX, [RCX]
  add RBX, RAX     add RBX, PHYS1    add RBX, PHYS1
  mov RAX, [RCX]   mov RAX, [RCX]

The first and third instructions can be executed at the same time on independent functional units. The third is out-of-order with respect to the second.

[1] https://inst.eecs.berkeley.edu/~cs152/sp20/lectures/L10-Comp...

[2] https://en.wikipedia.org/wiki/Tomasulo_algorithm

cogman10|4 years ago

I've not seen this in other responses, but an answer to your question is every logical register adds overhead to context switching by the operating system.

The OS has to store and load all registers whenever it decides to switch which thread is processing. 100 more logical registers means 100 more locations the OS has to keep track of.

This is part of the reason why new SIMD instruction sets need OS support before you can start using them.

kps|4 years ago

> Adding logical registers breaks backwards compatibility

This, plus adding logical registers increases instruction size and therefore decreases the number of instructions that can be fetched with a given memory bandwidth.

unknown|4 years ago

[deleted]

enragedcacti|4 years ago

It can definitely be beneficial to add logical registers. When AMD designed x86-64 they doubled the number of general logical registers up to 16. As other commenters have said, unless you are already making breaking changes, increasing the number of logical registers is probably not worth it.

msla|4 years ago

Having more physical registers than logical means the CPU can do optimizations, opcodes don't have to be as big (it takes more bits to encode more registers), compatibility with older binary code is maintained, and CPUs at different price points can have different numbers of physical registers while all being able to run the same binaries.

(I don't know if any manufacturer actually does that last thing, however.)

guerrilla|4 years ago

> * Adding logical registers breaks backwards compatibility, or at best means you get no speedup on things written (/compiled) for fewer logical registers. Adding physical registers lets you improve performance without recompiling.

> * Adding logical registers increases complexity for people writing assembly and/or compilers. Adding physical registers moves that complexity to people designing CPUs.

These two points are the same thing: compatibility and compatibility is what Intel and AMD have lived on from day one with x86. It's why we still live with this really weird instruction set with all of its historical oddities. Certain features of real mode weren't removed until long into the 64-bit era. Adding things isn't any better: If you wanted to add more add more registers, you'd have to change instruction encoding and the instruction space is finite (actually limited to 15 bytes.) That would be rather disruptive.

tenebrisalietum|4 years ago

I think your second point hits it and is the primary benefit to hiding the microarchitecture layer - it can be improved and existing code will benefit from it.

Basically Intel is saying if you had 200 GPRs, you couldn't do better at using the free ones than the CPU scheduler/decoder.

> Adding *architecturally visible* registers increases complexity for people writing assembly and/or compilers.

More registers just makes your code less likely to have to shuffle stuff to and back from RAM - which is where stuff will go if you don't have registers.

It's always faster for a CPU to access registers within itself than have to talk over a bus to a memory. Even when RAM was the same speed as CPUs (8-bit era) you would still save a cycle or two.

brucedawson|4 years ago

Having more logical/architectural registers is great except for a few costs:

1) More bits to encode register numbers in instructions. Doubling the number of logical registers costs another two or three bits depending on how many registers are referenced in an instruction

2) Logical registers have to be saved on context switches

3) Logical registers have to be saved around function calls. Either the caller or the callee has to save (or not use) registers, and most functions are both callers and callees. That is, if you are not a leaf-node function then every register you use you have to first save to the stack, or else assume that the functions you call will trash it. Thus, more registers have diminishing returns.

4) No matter how many logical registers you have you _always_ want to have an order of magnitude more physical registers, because otherwise you can't implement OOO or speculative execution.

Point #4 is probably the most critical because I think what people are really asking is why are there more physical than logical registers, and OOO/speculative execution is the answer.

captainmuon|4 years ago

I wonder how things like register renaming (or pipelining) are implemented. It would seem difficult even in a high level language, but they do it inside the processor. Is this in microcode that runs on the "actual" processor? Or is it in hardware? Do they write th algorithm in a language like VHDL or Verilog?

brucedawson|4 years ago

Register renaming is implemented in hardware. Because it is used on every instruction it is on the critical path and is probably hand-optimized. Here is some more reading on this topic:

https://en.wikipedia.org/wiki/Register_renaming

monocasa|4 years ago

Renaming isn't really done in microcode. Microcode is just another source for ops that get renamed. All of the renaming happens in hardware, and boils down to a handful of tables inside the processor.

comex|4 years ago

Some cores are open source and you can see for yourself.

Rename logic from BOOM, a RISC-V core written in a DSL embedded in Scala:

https://github.com/riscv-boom/riscv-boom/blob/1ef2bc6f6c98e5...

From RSD, a core designed for FPGAs written in SystemVerilog:

https://github.com/rsd-devel/rsd/blob/master/Processor/Src/R...

And then there's Alibaba's recently open-sourced XuanTie C910, which contains this Verilog… which is completely unreadable. Seems like it was produced by some kind of code generator that they didn't open-source?

https://github.com/T-head-Semi/openc910/blob/d4a3b947ec9bb8f...

dnautics|4 years ago

All of the above, I believe.

dexen|4 years ago

Beyond reasons & limitations explained in sibling posts, having large number of (logical / instruction-level) registers also inflates instruction size, and thus diminishes instruction density, and thus lowers performance - so there is a trade-off between that and large number of registers. Hear me out.

The CPU has limited memory bandwidth; the larger instruction size, the more bytes needs to be loaded from memory to execute the instruction. Same with cache size - the more space an instruction takes, the lower the amount of instructions that is cached. Lastly, there's the complexity & latency of the instruction decoder. This possible performance loss is averted by keeping instructions short and instruction set "dense".

Any instruction that refers to a register needs certain amount of bits in the operand portion to indicate which specific register(s) is to be used [1][2][3]. As example, in case of 8-register x86 the operand generally uses 3 bits just to indicate which register to use. In case of 16 register x86_64, it takes 4 bits. If we wanted to use all 200 physical register, that would require whole 8 bits reserved in the instruction just to indicate the register to use. Certain instructions - data transfer, algebra & bitwise operations, comparisons, etc. - naturally use two or more registers, so multiply that accordingly.

Since using this many registers gives only diminishing return in terms of performance (and also requires very heavy lifting on compiler's part[4]), the trade-off selected is that the compiler uses architecture-defined small number of registers, and the processor at runtime is able to speed up some code using the spare registers for instruction-level execution parallelism.

[Edit]

There's one more common circumstance where large number of registers is undesirable: a change of execution context (thread switch; process switch; interrupt). Typically all architecturally-visible registers are saved to memory on a change of context and new set is loaded for the new context. The more registers there are, the more work is to be done. Since the hardware registers are managed directly by CPU and serve as more of cache than directly accessed register, they don't need to be stored to memory.

[1] Aside of certain specialized instructions that implicitly use a particular register; for example in x86 many instructions implicitly use the FLAGS register; DIV/IDIV integer division implicitly uses AX and DX registers.

[2] Aside of certain instruction prefixes that influence which register is used; for example in x86 that would be segment register overrides.

[3] Aside of certain architectures where registers were organized in a "file" and available only through a "window" - i.e., an implicit context, implicit register addressing base; instruction operands referred to registers relative to the current window, and the whole window could be shifted by specialized instructions. Typically shifted on function enter/leave and similar. This was more-or-less the whole "hardware registers" being exposed at architecture level, however in a somewhat constrained / instruction-dense way.

[4] Arranging which registers to use, which to spill to memory etc. is non-trivial work for compiler, and the complexity grows super-linearly with the number of registers.

foobiekr|4 years ago

the ISA defines logical registers

the implementation defines physical registers

any implementation is permissible as long as it conforms to the ISA