top | item 18230383

Modern Microprocessors – A 90-Minute Guide (2001-2016)

443 points| projectileboy | 7 years ago |lighterra.com | reply

87 comments

order
[+] ridiculous_fish|7 years ago|reply
This is no doubt obvious to hardware folks, but one enlightening moment is when I came to understand register renaming.

Previously I had the (wrong) idea that rdi, rsi, etc corresponded to physical bits. Register renaming involved some exotic notion where these registers might be saved and restored.

Now I understand that rdi, rsi, etc. are nothing but compressed node identifiers in an interference graph. Ideally we'd have infinite registers: each instruction would read from previous registers and output to a new register. And so we would never reuse any register, and there'd be no data hazards.

Alas we have finite bits in our ISA, so we must re-use registers. Register renaming is the technique of reconstructing an approximate infinite-register interference graph from its compressed form, and then re-mapping that onto a finite (but larger) set of physical registers.

Mostly "register renaming" is a bad name. "Dynamic physical register allocation" is better.

[+] cgrand-net|7 years ago|reply
"Now I understand that rdi, rsi, etc. are nothing but compressed node identifiers in an interference graph. Ideally we'd have infinite registers: each instruction would read from previous registers and output to a new register. And so we would never reuse any register, and there'd be no data hazards"

I'm under the impression that we don't see the causality flowing in the same direction.

To me we first had a limited set of registers (imposed by the ISA), then to get better perf through out of order execution, cpus had to infer a deps graph and use register renaming.

Ironically all this silicon is spent to recover information that was known to the compiler (eg through SSA) and lost during codegen (register allocation).

[+] rayiner|7 years ago|reply
You’re really reconstructing a data depence graph not an interference graph. In an interference graph, there is an edge between nodes that are alive at the same time. That information remains implicit in the OOO machine (because a physical register stays allocated to a value through retirement). In a data dependence graph there are edges from operations to their operands. You need to reconstruct the data dependence links explicitly—during renaming you need to map the input operands to the correct renamed physical registers. This is where “renaming” comes from. You “rename” the architectural register operands in each instruction with the corresponding physical register or tag: https://www.d.umn.edu/~gshute/arch/register-renaming.xhtml. After renaming, the data dependence graph will be explicit in the reorder buffer.
[+] AtlasBarfed|7 years ago|reply
As was pointed out in another thread once which I don't recall, almost all aspects of assembly programming in x86 aren't really low level anymore. It's just another abstracted high-level language, although one that still mirrors/is tied to the fundamental idea of an x86 machine.

As detailed in the article, x86 opcodes are translated/compiled on the fly to micro-ops.

The registers are renamed as needed to squeeze data.

Memory locations are actually fulfilled by any of three to five levels of cache, RAM, or swap.

Cores are variable in their number of issues, and can share the same instruction pipelines in hyperthreading.

[+] zaarn|7 years ago|reply
While not quite infinite, what you want sounds awfully close to a belt machine;

Each instruction reads data from positions on the belt and then puts a few values at the start of the belt. This causes values on the end to "fall off".

The MILL CPU is an example of this architecture.

[+] fosk|7 years ago|reply
> Ideally we'd have infinite registers: each instruction would read from previous registers and output to a new register. And so we would never reuse any register, and there'd be no data hazards.

Crazy thought but would it be possible to build a decentralized and distributed CPU on top of the blockchain?

Edit: to the downvotes, care to explain? I am genuinely curious to know the answer, and curiosity should be a driving factor in this community

[+] jhallenworld|7 years ago|reply
There are some missing I/O things involving DMA.

In the old days, DMA from (say) a PCI device would go directly to and from DRAM. This incurs a high latency if the CPU needs to access this data.

Network processors found a simple solution: DMA goes to the cache, not the DRAM. This reduces the I/O latency to the processor and simplifies I/O coherency. I know Cavium's NPs rely on this.

Intel picked this up for server and desktop processors once both the memory controller and PCIe were integrated on the same die. They called it DDIO:

https://www.intel.com/content/www/us/en/io/data-direct-i-o-t...

You can support 100 G Ethernet with Intel Xeon processors these days due to this.

Another story is how DMA in the x86 world is cache coherent (no need to use uncached memory or flush before starting an I/O operation- which I have to do it in ARM). This is awesome from a device driver writer's point of view and is the result of having to support old operating systems from the pre-cache days.

I think the future will involve better control of how cache is shared. For example, if you know a program is going to access a lot of memory, but does not need to keep it around for a long time, it will as a side effect, eject useful data from the cache. Better would be to declare that a thread should only be able to use some fraction of the cache so that it does not interfere with other threads so much.

[+] dragontamer|7 years ago|reply
> Another story is how DMA in the x86 world is cache coherent (no need to use uncached memory or flush before starting an I/O operation- which I have to do it in ARM). This is awesome from a device driver writer's point of view and is the result of having to support old operating systems from the pre-cache days.

Nitpick: You mean "Sequentially consistent".

ARM is cache coherent, but NOT sequentially consistent. x86 is almost sequentially consistent (only a few obscure instructions here and there violate it).

[+] gpderetta|7 years ago|reply
> I think the future will involve better control of how cache is shared.

More fine grained cache control is definitely in the near future. x86 CPUs already allow partitioning L3 cache regions to different cores as desired. AMD CPUs provide the CLZERO instruction to quickly acquire a cacheline in exclusive mode and drop its content without waiting for any remotely modified data, which is great to implement message queues.

[+] rayiner|7 years ago|reply
Small nit: Intel desktop and workstation processors (Core i5/i7, Xeon E3, Xeon W) do not have DDIO.
[+] signa11|7 years ago|reply
indeed. dpdk uses this (and other techniques) to achieve 10gbps line-rate (at least) packet forwarding on minimally sized packets per (x86) core.
[+] ecuzzillo|7 years ago|reply
So, in @rygorous's excellent Twitch streams about CPU architecture (first one here: https://www.youtube.com/watch?v=oDrorJar0kM), he said that it was basically a myth that x86 architectures dynamically decoded into internal RISC instructions. I am thus a little skeptical of the article in general, since I don't know enough myself to verify each thing.
[+] abainbridge|7 years ago|reply
x86 really does decode CISC into RISC-like instructions. They're called micro-ops. Some of the instruction cache stores these translated instructions. People research the details of this. See https://www.agner.org/optimize/blog/read.php?i=142&v=t

The article looked about right to me.

I didn't watch the (3 hour!) video you linked to. Can you give the time offset where the myth you refer to is explained?

[+] baybal2|7 years ago|reply
I'd say a good thing to add will be that the lion share of progress in last 5 years was done around cache architectures.

All what is described in the article like superscalarity and ooe has been squeezed to the practical maximum at around early core 2 duo era, with all later advances mostly coming without qualitative architectural improvements.

In that regard, Apple's recent chips got quite far. They got to near desktop level performance without super complex predictors, on chip ops reordering, or gigantic pipelines.

Yes, their latest chip has quite a sizeable pipeline, and total on chip cache comparable to low end server CPUs, but their distinction is that they managed to improve cache usage efficiency immensely. A big cache would't do much to performance if you have to flush it frequently. In fact, the average cache flush frequency is what determines where diminishing returns start in regards to cache size.

[+] gpderetta|7 years ago|reply
Apple CPUs are quite sophisticated wide and deep OoO braniacs designs with state of the art branch predictors.

There is nothing simple about them. The only reason they are not desktop level performance is because the architecture has been optimized for a lower frequency target for power consumption.

A desktop optimized design would probably be slightly narrower (so that decoding is feasible with a smaller time budget) and and possibly deeper to accommodate the higher memory latency. Having said that, the last generation is not very far from reasonable desktop frequencies and might work as-is.

[+] meuk|7 years ago|reply
Interesting. Is the improvement in performance mostly due to improvement in size, number, and speed, did finetuning the cache parameters (number of caches, cache size, cache line size) help, or are there more fundamental architectural improvements? Do you have links to more information?
[+] graycat|7 years ago|reply
IBM was doing essentially all that stuff before 1990 or so except possibly for multiple threads per processor core. So, there was pipelining, branch prediction, speculative execution, vector instructions, etc.

Then I was in an AI group at the Watson Lab, and two guys down the hall had some special hardware attached to the processors and were collecting and analyzing performance data based on those design features.

[+] oblio|7 years ago|reply
IBM was doing all sorts of amazing stuff before the 1990's. They had VMs, containers, etc.

Personally, I'd say that I don't care. They didn't want to make that technology available to the masses, we barely even got the PC architecture because they made several strategic blunders.

If the tech exists but it's not reachable by common folks, in my eyes it's as bad if not worse that it not existing at all.

[+] phendrenad2|7 years ago|reply
Ah, bit rot. Both of the links to "interesting articles" at the bottom of the page are gone ("Designing an Alpha Microprocessor" 404s and the video appears to be gone from "Things CPU Architects Need To Think About"). Anyone know where these might have moved to?

(Anyway, great post!)

[+] shakna|7 years ago|reply
So, trying to hunt these down.

Designing an Alpha Microprocessor first appeared in a magazine called 'Computer', Volume 32, Issue 7, July 1999. It was on pages 27-34, and written by Matt Reilly.

It has a a few citations [0]. (And though I owned a lot of them, I don't think I read this particular Issue.)

Members can buy it from the IEEE [0]. That appears to be the only recourse.

---

Thing CPU Architects Need To Think About has a cover page here. [1] Unfortunately, the video isn't attached. But, it was part of the class EE380, which has a YouTube playlist [2], unfortunately though a lot of the talks are good, they don't include our video. Even worse, I found a fairly recent comment from another HNer [3], which suggests all online copies are gone. By persisting, I found the original asx via the Wayback Machine [4], which is utterly useless without the server.

Alas, I cannot find any working copy.

[0] https://ieeexplore.ieee.org/document/774915

[1] https://web.stanford.edu/class/ee380/Abstracts/040218.html

[2] https://www.youtube.com/playlist?list=PLoROMvodv4rMWw6rRoeSp...

[3] https://news.ycombinator.com/item?id=15900610

[4] https://web.archive.org/web/20130325010756/http://stanford-o...

[+] rdc12|7 years ago|reply
Looks like IEEE, ACM and ResearchGate all have a copy of Designing an Alpha Microprocessor, but the first two are paywalled and the latter requires you to request the text (possibly with pay too)

Couldn't find anything sadly on the other thou

[+] penglish1|7 years ago|reply
I could use an overview that includes an update to the Computer Architecture class I took in the early 90's. This is good - for "general purpose" microprocessors.

At that time, nothing at all was said about GPUs - they basically didn't count at all. I don't really recall anything about DSPs either. And FPGAs were considered neat and exotic, but a little useless, particularly compared to their cost and more of a topic for EE majors.

Now I've seen a great update (posted to HN) about how FPGAs are basically.. no longer FPGAs and include discrete microprocessors, GPUs and DSPs.. often many (low powered) of each!

This statement: "The programmable shaders in graphics processors (GPUs) are sometimes VLIW designs, as are many digital signal processors (DSPs),"

is about as far as it goes. Can someone point me to a 90-minute guide that expands on that?

* What about the GPUs and DSPs that are not VLIW designs? * What is the architecture of some of the more common GPUs and DSPs in general use today? (as they cover common Intel, AMD and ARM designs in this article). eg: Differences between current AMD and NVIDIA designs? I don't even know what "common 2018 DSPs" might be! * How does anything change in FPGAs now, and where is that heading? (the FPGAs-aren't-FPGAs article was a few years old)

[+] KMag|7 years ago|reply
A few questions I've had for a while:

First, if a reasonably high performance processor is going to use register renaming anyway, why not have split register files be an implementation detail? Tiny embedded processors can do without register renaming and have a single register file. Higher performance implementations can use split register files dedicated to functional units. Very few pieces of code both need large numbers of integers and large numbers of floating point numbers.

Second, on architectures designed with 4-operand fused multiply-add (FMA4) from the start, and a zero-register (like the Alpha's r31, SPARC's g0, MPIPS's r0, etc.), why not make the zero-register instead an identity-element-register that acts as a zero when adding/substracting and a one when multiplying/dividing? An architecture could optimize an FMA to a simple add, a simple multiply, or simply a move (depending on identity-element-register usage) in the decode stage, or a minimal area FPGA implementation could just run the full FMA. This avoids using up valuable opcodes for these 3 operations that can just be viewed as special cases of FMA. Move: rA = 1.0 * rC + 0.0. Add: rA = 1.0 * rC + rD. Multiply: rA = rB * rC + 0.0. FMA: rA = rB * rC + rD.

[+] Symmetry|7 years ago|reply
A processor tiny enough that co-locating the integer and floating point computation units closely enough to share a register bank is a good idea will be too small to use register renaming. Having separate clusters with their own banks and their own bypass networks is a really big win.

For the second, if you have a variable length instruction encoding scheme adding an extra argument is going to increase i-cache pressure. If not then you might as well if you do FMA4 but I think most fixed encoding ISAs use FMA3.

[+] rayiner|7 years ago|reply
In a three-address machine, separating the integer and floating point registers basically saves you three bits per instruction word compared to a unified register file of the same aggregate size. Also, on a 32-bit machine, you save a few transistors by making the integer rename registers 32 bits instead of all 64 bits to accommodate a double float. (And if you have vectors, it really makes no sense to throw away 128 or 256 or 512 bits to store a 32-bit or 64-bit integer).
[+] deepnotderp|7 years ago|reply
Integer and fp is indeed separate in many modern processors.
[+] twtw|7 years ago|reply
I guess this is a good opportunity...

It irks me a bit that scoreboarding is not considered "out of order execution" in modern classification. If I have a long latency memory read followed by an independent short latency instruction, the short second instruction will execute before the first has finished executing in a processor with dynamic scheduling via scoreboarding, but this doesn't "count" as OoO. I mostly get it, it just bothers me.

[+] em3rgent0rdr|7 years ago|reply
score-boarding to me represents an in-between, because:

1. they stall on the first RAW conflict. 2. they initiate execution in-order, although they may complete execution out-of-order.

I wish there was better nomenclature so people don't get confused, because clearly it doesn't fit into the dichotomy of in-order vs. out-of-order execution.

[+] pkaye|7 years ago|reply
Modern Processor Design by Shen is a great book if you want to read more on this stuff.
[+] nudgeee|7 years ago|reply
Great summary of (recent) modern computer architecture. Fun exercise: Try to spot how Spektre style attacks surface as a result.
[+] M_Bakhtiari|7 years ago|reply
> While the original Pentium, a superscalar x86, was an amazing piece of engineering, it was clear the big problem was the complex and messy x86 instruction set. Complex addressing modes and a minimal number of registers meant few instructions could be executed in parallel due to potential dependencies. For the x86 camp to compete with the RISC architectures, they needed to find a way to "get around" the x86 instruction set.

I've always struggled to understand why they didn't simply retire the x86 instruction set by the early 90s.

The best reason I've been given is an existing body of x86 software, but that's obviously nonsense as demonstrated by the Transmeta Crusoe and Apples's move from 68k to PPC to x86.