Which Machines Do Computer Architects Admire? (2013)

[+] xscott|6 years ago|reply

I'm not a computer architect (so my opinion shouldn't count in this thread), but as someone who did a lot of numerical programming over the years, I really thought Itanium looked super promising. The idea that you can indicate a whole ton of instructions can be run in parallel seemed really scalable for FFTs and linear algebra. Instead of more cores, give me more ALUs. I know "most" software doesn't have enough work between branches to fill up that kind of pipeline, but machine learning and signal processing can certainly use long branchless basic blocks if you can fit them in icache.

At the time, it seemed (to me at least) that it really only died because the backwards compatibility mode was slow. (I think some of the current perception of Itanium is revisionist history.) It's tough to say what it could've become if AMD64 hadn't eaten it's lunch by running precompiled software better. It would've been interesting if Intel and compiler writers could've kept focus on it.

Nowdays, it's obvious GPUs are the winners for horsepower, and it's telling that we're willing to use new languages and strategies to get that win. However, GPU programming really feels like you're locked outside of the box - you shuffle the data back and forth to it. I like to imagine a C-like language (analogous to CUDA) that would pump a lot of instructions to the "Explicitly Parallel" architecture.

Now we're all stuck with the AMD64 ISA for our compatibility processor, and it seems like another example where the computing world isn't as good as it should be.

[+] htfy96|6 years ago|reply

There's no free parallelism™️ though.

> Author: AgnerDate: 2015-12-28 01:46

> Ethan wrote:

> > Agner, what's your opinion on the Itanium instruction set in isolation, assuming a compiler is written and backwards compatibility do not matter?

> The advantage of the Itanium instruction set was of course that decoding was easy. The biggest problem with the Itanium instruction set was indeed that it was almost impossible to write a good compiler for it. It is quite inflexible because the compiler always has to schedule instructions 3 at a time, whether this fits the actual amount of parallelism in the code or not. Branching is messy when all instructions are organized into triplets. The instruction size is fixed at 41 bits and 5 bits are wasted on a template. If you need more bits and make an 82 bit instruction then it has to be paired with a 41 bit instruction.

(https://www.agner.org/optimize/blog/read.php?i=425)

Besides, the memory consistency model of Itanium is also a brain teaser used in interviews as counterexamples to poorly-synchonized solutions.

[+] jcranmer|6 years ago|reply

Itanium is essentially a VLIW architecture and... well, as the bottom of the page mentions, VLIW architectures tend to turn out to be bad ideas in practice.

GPUs showed two things: one, you can relegate kernels to accelerators instead of having to maximize performance in the CPU core; and two, you can convince people to rewrite their code, if the gains are sufficiently compelling.

[+] titzer|6 years ago|reply

15 years ago I thought Itanium was the coolest thing ever. As a compilers student, a software scheduled superscalar processor was kind of like a wet dream. The only problem is that that dream never materialized, due a number of reasons.

First, compilers just could never seem to find enough (static) ILP in programs to fill up all the instructions in a VLIW bundle. Integer and pointer-chasing programs are just too full of branches and loops can't be unrolled enough before register pressure kills you (which, btw, is why Itanium had a stupidly huge register file).

Second, it exposes microarchitectural details that can (and maybe should) change quite rapidly. The width of the VLIW is baked into the ISA. Processors these days have 8 or even 10 execution ports; no way one could even have space for that many instructions in a bundle.

Third, all those wasted slots in VLIW words and huge 6-bit register indices take up a lot of space in instruction encodings. That means I-cache problems, fetch bandwidth problems, etc. Fetch bandwidth is one of the big bottlenecks these days, which is why processors now have big u-op caches and loop stream detectors.

Fourth, there are just too many dynamic data dependencies through memory and too many cache misses to statically schedule code. Code in VLIW is scheduled for the best case, which means a cache miss completely stalls out the carefully constructed static schedule. So the processor fundamentally needs to go out of order to find some work (from the future) to do right now, otherwise all those execution units are idle. If you are going out of order with a huge number of execution ports, there is almost no point in bothering with static instruction scheduling at all. (E.g. our advice from Intel in deploying instruction scheduling for TurboFan was to not bother for big cores--it only makes sense on Core and Atom that don't have (as) fancy OOO engines).

There is one exception though, and that is floating point code. There, kernels are so much different from integer/pointer programs that one can do lots of tricks from SIMD to vectors to lots of loop transforms. The code is dense with operations and far easier to parallelize. The Itanium was a real superstar for floating point performance. But even there I think a lot of the scheduling was done by hand with hand-written assembly.

[+] nwallin|6 years ago|reply

> Instead of more cores, give me more ALUs.

It kinda didn't work that way though.

In practice, all of your ALUs, including your extra ones, were waiting on cache fetches or latencies from previous ALU instructions.

Modern x86 CPUs have 2-4 ALUs which are dispatched to in parallel 4-5 instructions wide, and these dispatches are aware of cache fetches and previous latencies in real time. VLIW can't compete here.

VLIW made sense when main memory was as fast as CPU and all instructions shared the same latency. History hasn't been kind to these assumptions. I doubt we'll see another VLIW arch anytime soon.

I accept the idea that x86 is a local minimum, but it's a deep, wide one. Itanium or other VLIW architectures like it were never deep enough to disrupt it.

[+] acomjean|6 years ago|reply

I worked on a project with machines that used the pa-risc cpus. The importance of optimized compilers can’t be understated (and math libraries) which made those machines really shine. My understanding was the Itanium (which basically replaced pa-risc in hp Unix machine lineup) never got the compiler support to realize the architectures strengths, so everyone looked to the safer bet in 64bit computing.

It’s hard to compete with the scale of x86. Like software I feel the industry tends toward one architecture (the more people use the architecture the better the compilers the more users ...) Even Apple abandoned PowerPC chips.

[+] yongjik|6 years ago|reply

I don't think x86 compatibility mattered. When it launched in 2001 it was supposed to replace HP's PA-RISC architecture which is totally different anyway. Sun and its SPARC processors were very much alive, Google was only three years old, and AWS was five years away - the idea that a massive array of cheap x86 processors will outperform enterprise-class servers simply hadn't occurred to most people yet.

Of course, the joke is that cheap x86 processors did outperform Itanium (and every other architectures, eventually).

[+] gnufx|6 years ago|reply

That seems right on such important sorts of computation, disregarding other factors. On the history, actual HPC numbers for Itanium have appeared in the Infamous Annual Martyn Guest Presentation over the years. An example that came to hand from 15 years ago to compare with bald statements is is https://www.researchgate.net/profile/Martyn_Guest/publicatio...

Regarding GPUs, Fujitsu may not agree (for Fugaku and spin-offs) depending on the value of "horsepower" relevant for HPC systems, even if an A64FX doesn't have the peak performance of a V100. They have form from the K Computer, and if they basically did it themselves again, there was presumably "co-design" for the hardware and software which may be relevant here; I haven't seen anything written about that, though.

[+] sansnomme|6 years ago|reply

That C like language is called "Verilog". (Yes I know it's a HDL but the point still stands. FPGAs are commodity these days.)

[+] tachyonbeam|6 years ago|reply

I think one of the most influential designs of recent times has been the DEC Alpha lineage of 64-bit RISC processors[1]. Originally introduced in 1992, with a superscalar design, branch prediction, instruction and data caches, register renaming, speculative execution, etc. My understanding is that when these came out, they were way ahead of any other CPU out there, both in terms of innovative design and performance.

Looking at this chip, it seems to me that almost all the innovations Intel brought to the Pentium lines of CPU over many years were basically reimplementing features pioneered by the DEC Alpha, just over a decade later, and bringing these innovations to consumer-grade CPUs.

[1]: https://en.wikipedia.org/wiki/DEC_Alpha

[+] xscott|6 years ago|reply

I loved working on DEC Alphas. They seemed to me like the best of breed conventional 64 bit machines, and it was sad when we quit buying them because x86 boxes were cheaper.

> it seems to me that almost all the innovations Intel brought to the Pentium lines of CPU over many years were basically reimplementing features pioneered by the DEC Alpha

I can't find a strong source to link, but I thought most of the Alpha team ended up at Intel. If so, that would explain the trickling in of re-implementations.

[+] kragen|6 years ago|reply

Superscalar dates to the CDC 6600, a machine I don't understand very well; branch prediction I think dates to Stretch, but the 70s at latest; instruction and data caches were commonplace in high-performance computers by the 1970s, and as the article points out, the 6600 had I$, and the 360/85 had a cache too (not sure if split). I'm not sure about register renaming and speculative execution, but I'd be surprised if they date from as late as the αXP.

[+] shaklee3|6 years ago|reply

Alpha is also unique in that it's the only architecture that doesn't preserve data dependency ordering.

[+] bitminer|6 years ago|reply

Cray built a massively parallel machine out of a bunch of Alphas. (Cray Research Inc, iirc, not Seymour Cray). The T3 I think.

DEC had bragged about the features of the cpu useful for parallelization. Cray engineering complained about the features missing for parallelization. It was all described in a glossy Cray monthly magazine description of the new machine but I've been unable to locate a copy.

I disliked the Alpha floating point, it was always signalling exceptions for underflow. Otherwise a fine set of machines.

[+] acomjean|6 years ago|reply

Intel ended up buying Digital Equipments chip making business..

There where patent disputes and such.

According to this article which describes the sale, windows NT ran on alpha.

https://www.latimes.com/archives/la-xpm-1997-oct-28-fi-47463...

[+] kjs3|6 years ago|reply

AMD actually licensed a good bit of Alpha technology.

[+] bloopernova|6 years ago|reply

I remember, way back in the 90s, working on NT 3.5/4 on some DEC Alphas in Sony Broadcast at Oxford, UK. The sysadmin there was a cool dude who I remember was amazed at how insane network speeds were getting. I think those guys at Oxford were responsible for a very nice recording studio mixer that Sony made.

I remember DEC Alphas absolutely stomped all over the x86 stuff that everyone else was using, but the flexibility and price of the commodity PCs was just too attractive. Pity, really.

[+] gnufx|6 years ago|reply

I don't know if it's relevant for computer architects, but one great thing about Alphas (at least the ones I operated) was their relatively huge memory and cache. They were much admired generally by users processing data. An individual crystallography image might fit in cache, and a typical complete dataset in memory -- not that that stopped the maintainer of the main analysis program retaining the disk-based sort bottleneck originating on PDP-11s...

[+] kragen|6 years ago|reply

There are some really great designers on the list, like Sophie Wilson and Gordon Bell, but the list of admirable machines comes up really short — and missing a lot of really significant and admirable machines.

Maybe these are the machines bad computer architects, like Alpert, admire. Alpert is notable mostly for leading the computer industry's most expensive and embarrassing failure, the Itanic (formally known as the Itanium), despite the presence on his team of many of the world's best CPU designers, who had just come from designing the HP-PA --- a niche CPU architecture nevertheless so successful that HP's workstation competitors, such as NeXT, started using it. Earlier in his career he sunk the already-struggling 32000, the machine that by rights should have been the 68000. (And maybe if they'd funded GCC it could have been.)

What about the Tera MTA, with its massive hardware multithreading and its packet-switched RAM, which was gorgeous and prefigured significant features of the GPU explosion?

What about the DG Nova, with its bitslice ALU chips and horizontal-microcode instructions? What about the MuP21, with its radical on-chip dual circular stacks?

What about the HP 9100, with its dual stacks and PCB-inductance microcode, where the instruction set was the user interface?

What about the LGP-30, which managed to deliver a usable von Neumann computer with only 113 vacuum tubes (for amplification, inversion, and sequencing)?

What about the 26-bit ARM, with its conditional execution on every instruction, and packing the program status register into the program counter so it automatically gets restored by subroutine return, and, more importantly, interrupt return?

What about Thumb-2 with its unequaled code density?

What about the CM-1? Anyone can see that AVX-512 (or for that matter modern timing-attack-resistant AES implementations!) owe everything to the CM-1.

And the conspicuous omission of the Burroughs 5000 has already been noted by others.

I mean, there are some good designs on the list! But it hardly seems like a very comprehensive list of admirable designs.

[+] sitkack|6 years ago|reply

It sounds like they just went around the room and asked some folks to list off some systems. I don't think a terrible amount of thought was put into this.

I'd add the Tandem NonStop to my personal list. I don't know why I overlooked the LGP-30 [1], I'll have to find a schematic. 113 vacuum tubes is really impressive, I wonder if there is any overlap with this design and System Hyper Pipelining [2]. Do you know of other architectures that use time multiplexing to reduce part count?

What bit serial computers do you like?

Ahh, it is the Story of Mel computer, awesome.

[1] https://en.wikipedia.org/wiki/LGP-30

[2] https://arxiv.org/abs/1508.07139

[+] AceJohnny2|6 years ago|reply

I can think of many ways you could've phrased this without being downright aggressive.

[+] jcranmer|6 years ago|reply

> What about Thumb-2 with its unequaled code density?

It came out in 2003, and most of the people were queried for their opinions in 2001.

[+] kjs3|6 years ago|reply

It was never intended as a comprehensive list. Best if you actually read the article. It's been floating around for more than a decade.

Alpert is a bad architect...funny.

[+] tlb|6 years ago|reply

It's disappointing that most machines today suck so badly. How did that become the state of the industry, with so many smart people working so hard and nobody likes their latest designs?

The last high-performance design I actually liked was the DEC Alpha. You could write a useful JIT compiler in a couple hundred lines.

I suspect that nVidia's recent GPUs are wonderfully clever inside, but they don't publish their ISA and the drivers are super-clunky. So I can't admire them.

I appreciate the performance of intel Core chips, but there's so much to dislike. The ISA is literally too big to fully document. The kernel needs 1000s of workarounds for CPU weirdnesses. You have to apply security patches to microcode, FFS.

RISC-V would be great if we had fast servers and laptops.

[+] kjs3|6 years ago|reply

What's wrong with Power 8 & 9? What's wrong with ARM64? What was wrong with Sparc64 until Oracle screwed it up (well...register windows...ok). How is RISC-V intrinsically better than those architectures, considering it doesn't exist in a form that performs anywhere near as fast?

[+] oddity|6 years ago|reply

> How did that become the state of the industry, with so many smart people working so hard and nobody likes their latest designs?

The average consumer doesn't buy a fantastically well designed CPU if it doesn't run the software they care about. x86, externally, is horrifically ugly primarily because of backwards compatibility (I've legitimately had a nightmare once from writing an x86 JIT compiler). Internally, I'm almost certain it's an incredible feat of engineering. People who admire architecture aren't a powerful market force, I'm sad to say.

[+] agumonkey|6 years ago|reply

Mass market doesn't value admiration

[+] bane|6 years ago|reply

Surprised nobody picked the Atari 400/800 and Amiga 500 computers (which are the 8-bit and 16-bit spiritual parent/child machines by the same people).

On the other end, pure CPU only machines are kind of interesting as a study in economy, like the ZX Spectrum, a horrible, limited architecture that managed to hit the market at an unreasonably cheap price, make money, and end up with tens of thousands of games.

[+] CalChris|6 years ago|reply

Interesting that the B5000 didn't make this list. Berkeley CS252 has been reading the Design of the B5000 System paper for years. The lecture slides don't criticize it but Computer Organization and Design sorta does:

The Burroughs B5000 was the commercial fountainhead of this philosophy (High-Level-Language Computer Architectures), but today there is no significant commercial descendant of this 1960s radical.

[+] Aloha|6 years ago|reply

I was also surprised - but I wonder thats the computer architectures language designers like, not computer architects.

[+] oddity|6 years ago|reply

The list seems biased towards pre-2001, so I’ll toss one in: Cell. I hold that it was so ahead of its time, it dragged game devs, kicking and screaming, into the future ahead of schedule when they were forced to support the PS3 for the extended console cycle. :)

Larrabee was cute, but to this day I still have no idea what their target workload was.

[+] kjs3|6 years ago|reply

Yup. Most of this was culled from a 2001 conference (so small but distinguished sample set), and you really need to read the detail to understand what they were appreciative of. It's not a good/bad thing and probably represented what they were thinking about at the time (e.g. Alpert calls out Multiflow because it influenced a processor he built). Sites even includes a backhand at VAX by calling it the example of what they didn't do in Alpha; damning with faint praise.

I haven't fired up my Cell dev board (Mercury) in a while. Prolly should do that. :-)

[+] DonHopkins|6 years ago|reply

I always thought of the 6809 as the Chrysler Cordoba of 8 bit microprocessors, with soft Corinthian Leather upholstery and a luxurious automatic multiply instruction.

https://www.youtube.com/watch?v=Vsg97bxuJnc

[+] erosenbe0|6 years ago|reply

The CDC-6000 and Cray-1 designed by Seymour Cray are the most admired, hands down.

It is also notable that quite a bit of R&D was done in Chippewa Falls, WI, which is just a regular old town in America's Dairyland.

[+] gumby|6 years ago|reply

Surprised the PDP-6/10 didn’t make the list as it was the dominant research architecture for a certain period. Another Gordon Bell jewel.

[+] kjs3|6 years ago|reply

Alas...so little respect for the 36-bitters these days. The PDP-10 especially was hugely influential.

[+] PeterStuer|6 years ago|reply

As far as processors are concerned I loved the Zilog Z80 and the Motorola 68000. Oddly enough I really disliked the MT 6502 and the Intel 8086.

As total systems I loved the HP 41CX, the Sinclair ZX Spectrum, the Symbolics Lisp Machine and the Apple Mac IIcx (or really just any Mac before the PowerPC debacle).

After that era, I just started home-building x86 machines, and while there was the odd preferred component, it never went beyond the 'A is better than B' stage.

[+] kazinator|6 years ago|reply

I started on the 6502, but outgrew it.

Still, cult chip! I mean, something like the following shows obsessive dedication to the thing:

http://www.visual6502.org/JSSim/

[+] fouc|6 years ago|reply

Anyone admire forthchips? Such as the 144-core chip from http://www.greenarraychips.com

[+] Merrill|6 years ago|reply

>Processor design pitfalls - Designing a high-level ISA to support a specific language of language domain

Is there an equivalent pitfall in designing the ISA to support a specific Virtual Machine?

For example, wouldn't the performance of a server processor when running the Java Virtual Machine be a key factor in determining its commercial success? I've always wondered whether the failure of Itanium wasn't at least partly caused by the shift from binary executables to bytecode with the contemporary success of the Java language. Even when JIT compilers were used, they were probably too simple to take advantage of the VLIW architecture.

[+] pdimitar|6 years ago|reply

I don't feel that's the core reason but you do bring up a good point; some technologies are too good for their time and get swept in the history books due to nobody having a clue how to utilise them properly.

Not sure if that's the exact case for Itanium but your argument fired a neuron. :)

[+] mtreis86|6 years ago|reply

The machines I most admire are mechanical computers like the ones used in WWII era battle ships for targeting their long guns. Those machines performed differentiation and curve matching using cams and gears.

[+] rootbear|6 years ago|reply

There is a fascinating series of videos on YouTube that describe the US Navy analog fire control computers. I had no idea such things existed until I came across those videos.

[+] cptnapalm|6 years ago|reply

I have a small IBM 390 which I haven't been able to find out much, but I did spot while searching that my 1999 S/390 has a 256 byte cache line. That's 4x over a 2020 i7.

[+] bshanks|6 years ago|reply

The ones listed by 4 or more people (not including Bell) were:

- CDC-6600 and 7600 - listed by Fisher, Sites, Smith, Worley

- Cray-1 - listed by Hill, Patterson, Sites, Smith, Sohi, Wallach (also Bell, sorta)

- IBM S/360 and S/370 - listed by Alpert, Hill, Patterson, Sites (also Bell)

- MIPS - listed by Alpert, Hill, Patterson, Sohi, Worley

Special mention:

- 6502 - only listed by Wilson, but she was the chief architect of ARM so i think her choice is important to note

- Itanium - mentioned in the top-ranked comment in this HN discussion

- DEC Alpha - mentioned in the second-ranked comment in this HN discussion

[+] cameldrv|6 years ago|reply

Pentium Pro should be on the list. The out of order execution, especially with the micro-op translation was a huge breakthrough.

[+] squarefoot|6 years ago|reply

M68k and Z80 IMO deserved to be in that list much more than x86.

[+] ChuckMcM|6 years ago|reply

I was always partial to the DEC-10 architecture. That said my first exposure to a machine that had been really well thought out was the IBM 360.

[+] dillonmckay|6 years ago|reply

https://en.m.wikipedia.org/wiki/VAX-11

32 bit system from the late 1970s.

[+] bitminer|6 years ago|reply

The VAX was, I think, co-designed alongside VMS. The two together were an innovative design, distinguishing architecture from implementation, a comprehensive isa, a roadmap for the future, etc etc. VAXcluster was amazing integration of both.

I believe the design was influenced by Djikstra's Structured Programming book but have no evidence.

My epiphany on the issues with the isa came when I discovered that the checksum calculation used by the VMS backup utility was faster when done in a short instruction loop over the microcoded instruction. MicroVAX II. Microcodes were a huge barrier between the speed potential of the electronics and the actual visible isa. Duh!

Cray knew this, but he didn't build product lines, just single point products. Sun built product lines with RISC and ate Digital Equipment's lunch.

156 comments