top | item 10247074

(no title)

jensnockert | 10 years ago

Well… there's other issues than just making code smaller, the idea with having a fixed length instruction size is making the decoder simpler.

Decoding enough instructions to feed a wide issue machine is really hard on x86 and can require loads of power due to the ISA, while if you have fixed or semi-fixed size instructions (like thumb), it is much easier.

You can design ISAs that are made for wide issue, cheap decoding, and compact encoding at the same time, but unfortunately it required asking questions that just was not available to the MIPS/ARM/x86 designers. Out of order execution superscalar processors just weren't invented yet.

discuss

order

userbinator|10 years ago

The problem is that a simpler decoder doesn't compensate for the extra instruction cache needed to achieve the same hit rates/levels of performance, and that is bad for power efficiency since L1 cache needs to run at full core speed and in modern CPUs there's vastly more transistor area in the cache than the decoder. The increased memory traffic from lower hit rates also doesn't help. This article shows that effect quite clearly:

http://www.extremetech.com/extreme/188396-the-final-isa-show...

The x86s have 32K of L1 icache, the ARMs 32K or 16K, and the MIPS Loongson has 64K. Also, the Loongson does not support MIPS16 whereas the ARMs all support Thumb. If you look at the total energy consumed, the MIPS is noticeably worse than x86 or ARM:

http://www.extremetech.com/wp-content/uploads/2014/08/Averag...

In fact, the cache takes so much power that Intel engineers have found it profitable to turn off parts of the cache when in low-power modes; this feature is called Dynamic Cache Sizing and appears in the later Atom series.

adwn|10 years ago

> that is bad for power efficiency since L1 cache needs to run at full core speed and in modern CPUs there's vastly more transistor area in the cache than the decoder

It's not that simple. Dynamic power depends on the toggle rate of the flip-flops and the electrical capacitance of the fan-out wires and gates, not on the number of transistors. In a cache, very few storage elements change their state in every cycle, while the decoder performs a lot of work in every cycle.

jensnockert|10 years ago

It's even more complicated than that, since the cache doesn't have to cache encoded instructions, they can actually store decoded instructions, and a few of the caches on a modern x86 cpu actually does that, for example there's a loop cache after the decoders, so that small loops never have to be decoded more than once.

pcwalton|10 years ago

> The problem is that a simpler decoder doesn't compensate for the extra instruction cache needed to achieve the same hit rates/levels of performance

Except this isn't true for x86-64, because x86-64 instructions are just as large as ARM instructions in practice.

rdc12|10 years ago

And the MIPS is based on a 90nm process vs the 32nm of the Sandy Bridge they tested, while that is relevent to what you can buy, it says nothing about the intrinsic properties of the design.

Intel has had a massive advantege in fabrication for a long time.