top | item 39114380

(no title)

hawflakes | 2 years ago

> Itanium is designed for a machine with three execution units and each instruction pack has up to three instructions, one for each of them. The design was that each bundle had some extra bits including a stop which was a sort of barrier to execution. The idea was that you could have a series of bundles with no stop bit and the last one would set it. That meant the whole series could be safely scheduled on a future wide IA64 machine. Of course that meant the compiler had to be explicit about that parallelism (hence EPIC) but future machines would be able to schedule on the extra execution units. This also addressed the problem where VLIW traditionally would require re-compilation to run/run more efficiently on newer hardware.

> Due to not having register renaming, VLIW architectures conventionally have a large register file (128 registers in the case of the Itanium). This slows down context switches, further reducing performance. Out-of-order CPUs can cheat by having a comparably small programmer-visible state, with most of the state hidden in the bowels of the processor and consequently not in need of saving or restoring.

Itanium borrowed the register windows from SPARC. It was effectively a hardware stack that had a minimum of 128 physical registers but were referenced in instructions by 6 bits — e.g. 64 virtual registers, iirc. So you could make a function call and the stack would push. And a return would pop. Just like SPARC execept they weren't fixed-sized windows.

That said, the penalty for spilling the RSE (They called this part the Register Stack Engine) for say, an OS context switch was quite heavy since you'd have to write the whoe RSE state to memory.

It was pretty cool reading about this stuff as a new grad.

> Another enginering issue was that x86 simulation on the Itanium performed quite poorly, giving existing customers no incentive to switch.

As I mentioned in my previous comment Merced had a tiny corner of the chip devoted to the IVE, Intel Value Engine which was meant to be the very simple 32-bit x86 chip meant mainly for booting the system. The intent was (and the docs also had sample code) to boot, do some set up of system state, and then jump into IA64 mode where you would actually get a fast system.

I think they did devote more silicon to x86 support but I had already served my very short time at HP and Merced still took 2+ years to tape out.

discuss

clausecker|2 years ago

> The design was that each bundle had some extra bits including a stop which was a sort of barrier to execution. The idea was that you could have a series of bundles with no stop bit and the last one would set it. That meant the whole series could be safely scheduled on a future wide IA64 machine. Of course that meant the compiler had to be explicit about that parallelism (hence EPIC) but future machines would be able to schedule on the extra execution units. This also addressed the problem where VLIW traditionally would require re-compilation to run/run more efficiently on newer hardware.

Thanks, that makes sense. I did not understand the intent of the stop bits correctly. However, it still seems like the design wouldn't scale super well: if you have less ports, you want to schedule dependent instructions on the critical path as early as possible, even if other independent (but not latency-critical) instructions could be scheduled earlier, incurring extra stop bits. So while some degree of performance-portability is designed into the hardware, the compiler may have a hard time generating code that is scheduled well on both 3 port and possible future 6 port machines.

This reminds me of macro-fusion, where there's a similar contradiction: macro fusion only triggers if the fusable instructions are issued back to back. But when optimising for a multi-issue-in-order design, you usually want to interleave dependency chains (i.e. not issue dependet instructions back to back) such that all the pipelines are kept busy. So unless the pairs that fuse are the same on all of them, it's very hard to generate code that performs well on a variety of microarchitectures.

hawflakes|2 years ago

I don't remember if the parent article mentioned it but there were also a bunch of things like the predicate bits for predicated execution and I remember trying to gain an advantage using speculative loads was also very tricky. In the end it was pretty gnarly.

The other bit no one mentions is that it was an HP-Intel alliance. HP committed to PA-RISC compatibility with a combination of hardware and software whereas Intel just expected stuff to run.

From the instruction reference guide: ``` Binary compatibility between PA-RISC and IA-64 is handled through dynamic object code translation. This process is very efficient because there is such a high degree of correspondence between PA-RISC and IA-64 instructions. HP’s performance studies show that on average the dynamic translator only spends 1-2% of its time in translation with 98-99% of the time spent executing native code. The dynamic translator actually performs optimizations on the translated code to take advantage of IA-64’s wider instructions, and performance features such as predication, speculation and large register sets ```

There was some hardware support for 32-bit userspace binaries. See the addp4 instruction.

Findecanor|2 years ago

> That said, the penalty for spilling the RSE (They called this part the Register Stack Engine) for say, an OS context switch was quite heavy since you'd have to write the whoe RSE state to memory.

I've read that the original intention for the RSE was that it would have saved its state in the background during spare bus cycles, which would have reduced the amount of data to save when a context switch happened.

Supposedly, this was not implemented in early models of the Itanium. Was it ever?