(no title)
philh | 4 years ago
> An x86-64 CPU has sixteen integer registers, but 100-200 physical integer registers. Every time an instruction writes to, say, RAX the renamer chooses an available physical register and does the write to it, recording the fact that RAX is now physical-register #137. This allows the breaking of dependency chains, thus allowing execution parallelism.
I'm curious why they have so many more physical registers than... logical? registers. I have a couple of guesses:
* Physical registers are physically cheaper to add than logical registers.
* Adding logical registers breaks backwards compatibility, or at best means you get no speedup on things written (/compiled) for fewer logical registers. Adding physical registers lets you improve performance without recompiling.
* Adding logical registers increases complexity for people writing assembly and/or compilers. Adding physical registers moves that complexity to people designing CPUs.
Are some of these correct? Other reasons I'm missing?
ridiculous_fish|4 years ago
A compiler's IR is a directed graph, where nodes are instructions and tagged by an assigned register. It would be pleasant to assign each node a distinct register, but then machine instructions would be unacceptably large. So the compiler's register allocator compresses the graph, by finding nodes that do not interfere and assigning them the same register.
The CPU's register renamer then reinflates this graph, by inspecting the dataflow between instructions. If two instructions share a register tag, but the second instruction has no dependence on the first, then they may be assigned different physical registers.
`xor eax, eax` has no dependence on any instruction, so it can be specially recognized as allocating a new physical register. In this way of thinking, `xor eax, eax` doesn't zero anything, but is like malloc: it produces a fresh place to read/write, that doesn't alias anything else.
brucedawson|4 years ago
And, regardless of how many logical registers you have you need to have more physical registers. These are needed for out-of-order (OOO) execution and speculative execution. An OOO super-scalar speculative CPU can have hundreds of instructions in flight and these instructions all need physical registers to work on. If you don't have excess physical registers you can't do OOO or speculative execution.
CalChris|4 years ago
What register renaming allows is to increase the performance of both new and existing programs, which is no mean feat. It allows the CPU scheduler to search for more out-of-order parallelism rather than relying on the compiler to find in-order parallelism.
This binary compatibility doesn't seem very important now, don't break userspace excepted, but it was then. Compatibility made IBM and Intel hundreds of billions of dollars.
CalChris|4 years ago
Register renaming allows instructions to be executed out-of-order [1] which allows for more instruction throughput.
This goes back to 1967 and to the IBM 360/91 with its 4 floating point registers. That's not many registers but Moore's law was making more transistors available. The problem was how to use these transistors to get more throughput from existing programs without changing the ISA and (potentially) breaking compatibility.
The solution was Tomasulo's algorithm [2] which allowed (few) architectural registers to be renamed to (many) physical registers.
The first and third instructions can be executed at the same time on independent functional units. The third is out-of-order with respect to the second.[1] https://inst.eecs.berkeley.edu/~cs152/sp20/lectures/L10-Comp...
[2] https://en.wikipedia.org/wiki/Tomasulo_algorithm
cogman10|4 years ago
The OS has to store and load all registers whenever it decides to switch which thread is processing. 100 more logical registers means 100 more locations the OS has to keep track of.
This is part of the reason why new SIMD instruction sets need OS support before you can start using them.
kps|4 years ago
This, plus adding logical registers increases instruction size and therefore decreases the number of instructions that can be fetched with a given memory bandwidth.
unknown|4 years ago
[deleted]
enragedcacti|4 years ago
msla|4 years ago
(I don't know if any manufacturer actually does that last thing, however.)
guerrilla|4 years ago
> * Adding logical registers increases complexity for people writing assembly and/or compilers. Adding physical registers moves that complexity to people designing CPUs.
These two points are the same thing: compatibility and compatibility is what Intel and AMD have lived on from day one with x86. It's why we still live with this really weird instruction set with all of its historical oddities. Certain features of real mode weren't removed until long into the 64-bit era. Adding things isn't any better: If you wanted to add more add more registers, you'd have to change instruction encoding and the instruction space is finite (actually limited to 15 bytes.) That would be rather disruptive.
tenebrisalietum|4 years ago
Basically Intel is saying if you had 200 GPRs, you couldn't do better at using the free ones than the CPU scheduler/decoder.
> Adding *architecturally visible* registers increases complexity for people writing assembly and/or compilers.
More registers just makes your code less likely to have to shuffle stuff to and back from RAM - which is where stuff will go if you don't have registers.
It's always faster for a CPU to access registers within itself than have to talk over a bus to a memory. Even when RAM was the same speed as CPUs (8-bit era) you would still save a cycle or two.
brucedawson|4 years ago
1) More bits to encode register numbers in instructions. Doubling the number of logical registers costs another two or three bits depending on how many registers are referenced in an instruction
2) Logical registers have to be saved on context switches
3) Logical registers have to be saved around function calls. Either the caller or the callee has to save (or not use) registers, and most functions are both callers and callees. That is, if you are not a leaf-node function then every register you use you have to first save to the stack, or else assume that the functions you call will trash it. Thus, more registers have diminishing returns.
4) No matter how many logical registers you have you _always_ want to have an order of magnitude more physical registers, because otherwise you can't implement OOO or speculative execution.
Point #4 is probably the most critical because I think what people are really asking is why are there more physical than logical registers, and OOO/speculative execution is the answer.
captainmuon|4 years ago
brucedawson|4 years ago
https://en.wikipedia.org/wiki/Register_renaming
monocasa|4 years ago
comex|4 years ago
Rename logic from BOOM, a RISC-V core written in a DSL embedded in Scala:
https://github.com/riscv-boom/riscv-boom/blob/1ef2bc6f6c98e5...
From RSD, a core designed for FPGAs written in SystemVerilog:
https://github.com/rsd-devel/rsd/blob/master/Processor/Src/R...
And then there's Alibaba's recently open-sourced XuanTie C910, which contains this Verilog… which is completely unreadable. Seems like it was produced by some kind of code generator that they didn't open-source?
https://github.com/T-head-Semi/openc910/blob/d4a3b947ec9bb8f...
dnautics|4 years ago
dexen|4 years ago
The CPU has limited memory bandwidth; the larger instruction size, the more bytes needs to be loaded from memory to execute the instruction. Same with cache size - the more space an instruction takes, the lower the amount of instructions that is cached. Lastly, there's the complexity & latency of the instruction decoder. This possible performance loss is averted by keeping instructions short and instruction set "dense".
Any instruction that refers to a register needs certain amount of bits in the operand portion to indicate which specific register(s) is to be used [1][2][3]. As example, in case of 8-register x86 the operand generally uses 3 bits just to indicate which register to use. In case of 16 register x86_64, it takes 4 bits. If we wanted to use all 200 physical register, that would require whole 8 bits reserved in the instruction just to indicate the register to use. Certain instructions - data transfer, algebra & bitwise operations, comparisons, etc. - naturally use two or more registers, so multiply that accordingly.
Since using this many registers gives only diminishing return in terms of performance (and also requires very heavy lifting on compiler's part[4]), the trade-off selected is that the compiler uses architecture-defined small number of registers, and the processor at runtime is able to speed up some code using the spare registers for instruction-level execution parallelism.
[Edit]
There's one more common circumstance where large number of registers is undesirable: a change of execution context (thread switch; process switch; interrupt). Typically all architecturally-visible registers are saved to memory on a change of context and new set is loaded for the new context. The more registers there are, the more work is to be done. Since the hardware registers are managed directly by CPU and serve as more of cache than directly accessed register, they don't need to be stored to memory.
[1] Aside of certain specialized instructions that implicitly use a particular register; for example in x86 many instructions implicitly use the FLAGS register; DIV/IDIV integer division implicitly uses AX and DX registers.
[2] Aside of certain instruction prefixes that influence which register is used; for example in x86 that would be segment register overrides.
[3] Aside of certain architectures where registers were organized in a "file" and available only through a "window" - i.e., an implicit context, implicit register addressing base; instruction operands referred to registers relative to the current window, and the whole window could be shifted by specialized instructions. Typically shifted on function enter/leave and similar. This was more-or-less the whole "hardware registers" being exposed at architecture level, however in a somewhat constrained / instruction-dense way.
[4] Arranging which registers to use, which to spill to memory etc. is non-trivial work for compiler, and the complexity grows super-linearly with the number of registers.
foobiekr|4 years ago
the implementation defines physical registers
any implementation is permissible as long as it conforms to the ISA