top | item 47025382

(no title)

What you describe sounds counter-intuitive. And the paper you cite seems to suggest an ISA extension to increase the number architected (!) registers. That is something very different. It makes most sense in VLIW architectures, like the ones described in the paper. Architectures like x86 do hardware register renaming (or similar techniques, there are several) to be able to exploit as much instruction level parallelism as possible. That is why I find you claim hard to believe. VLIW architectures traditionally provide huge register sets and make less use of transparent register renaming etc, that part is either explicit in the ISA or completely left to the compiler. These are very different animals than our good old x86...

discuss

noelwelsh|15 days ago

I'm not sure we're talking abou the same paper. Here's the one I'm referring to:

https://smlnj.org/compiler-notes/k32.ps

E.g. "Our strategy is to pre-allocate a small set of memory locations that will be treated as registers and managed by the register allocator."

There are more recent publications on "compiler controlled memory" that mostly seem to focus on GPUs and embedded devices.

BeeOnRope|13 days ago

Relevant section:

> Compiler controlled memory: There is a mechanism in the processor where frequently accessed memory locations can be as fast as registers. In Figure 2, if the address of u is the same as x, then the last load μ-op is a nop. The internal value in register r25 is forwarded to register r28, by a process called write-buffer feedforwarding. That is to say, provided the store is pending or the value to be stored is in the write-buffer, then loading form a memory location is as fast as accessing external registers.

I think it over-sells the benefit. Store forwarding is a thing, but it does not erase the cost of the load or store, at least certainly on the last ~20 years of chips and I don't think on the PII (the target of the paper) either.

The load and store still effectively occur in terms of port usage, so the usual throughput, etc, limits apply. There is a benefit in latency of a few cycles. Perhaps also the L1 cache access itself is omitted, which could help for bank conflicts, though on later uarches there were few to none of these so you're left with perhaps a small power benefit.