top | item 40764253

(no title)

Decoding 1 x86 instruction per cycle is easy. That's solved like 40 years ago.

The problem is that superscalar CPU needs to decode multiple x86 instructions per cycle. I think latest Intel big core pipeline can do (IIRC) 6 instructions per cycle, so to keep the pipeline full the decode MUST be able to decode 6 per cycle too.

If it's ARM, it's easy to do multiple decode. M1 do (IIRC) 8 per cycle easily, because the instruction length is fixed. So the first decoder starts at PC, the second starts at PC+4, etc. But x86 instructions are variable length, so after the first decoder decodes instruction at IP, where does the second decoder start decoding at?

discuss

kijiki|1 year ago

It isn't quite that bad. The decoders write stop bits back into the L1D, to demarc where the instructions align. Since those bits aren't indexed in the cache and don't affect associativity, they don't really cost much. A handful of 6T SRAMs per cache line.

jart|1 year ago

I would have assumed it just decodes the x86 into a 32-bit ARM-like internal ISA, similar to how a JIT works in software. x86 decoding is extremely costly in software if you build an interpreter. Probably like 30% maybe and that's assuming you have a cache. But with JIT code morphing in Blink, decoding cost drops to essentially nothing. As best as I understand it, all x86 microprocessors since the NexGen i586 have worked this way too. Once you're code morphing the frontend user-facing ISA, a much bigger problem rears its ugly head, which is the 4096-byte page size. That's something Apple really harped on with their M1 design which increased it to 16kb. It matters since morphed code can't be connected across page boundaries.