top | item 33320374

(no title)

avianes | 3 years ago

> Um... wat? No CPU tries to decode 99 bytes of memory in a cycle

Actually, no x86 processor decodes 8 instructions in parallel. This is an example to illustrate how the number of possible offsets scales with 15 instruction lengths.

> So you decode 32 instructions starting at each byte you've fetched

No you don't do that, it's too power consuming.

> But the combinatorics you're citing seem ridiculous, I don't understand that at all.

What I'm trying to explain is that decoding 8 instructions in parallel in x86 is hardly possible, while decoding 8 instructions (or more) from a RISC archi per cycle is never a problem

discuss

ajross|3 years ago

> No you don't do that, it's too power consuming.

Uh... yes you do? How else do you think it works? I'm not saying there's no opportunity for optimization (e.g. you only do this for main memory fetches and not uOp execution, pipeline it such that the full decode only happens a stage after length decisions, etc...), I'm saying that it isn't remotely an intractable power problem. Just draw it out: check the gates required for a 64->128 Dadda multiplier or 256 bit SIMD operation and compare with what you'd need here. It's noise.

And your citation of "8 instructions in parallel" seems suspicious. Did I just get trolled into a Apple vs. x86 flame war?

avianes|3 years ago

> Uh... yes you do? How else do you think it works?

No, I literally explain it in my first answer. The part about "1590 decoders" is irrelevant since a misunderstood your message (thinking that you are talking about using 16 decoders to decode the 16 instruction lengths of a single instruction).

But the rest on instruction length decode is how you actually do it.

> I'm saying that it isn't remotely an intractable power problem.

I mean, obviously, if you ignore all the power consumption issues of using 32 decoders in parallel and using only 5 of the results out of the 32. Then yes, there's no problem.

But in reality, yes it's a problem to decode many x86 instructions in parallel.

> Just draw it out: check the gates required for a 64->128 Dadda multiplier or 256 bit SIMD operation and compare with what you'd need here. It's noise.

Yes, the energy consumption of the multipliers is high, but I don't see how this is an argument to make an inefficient decoder? Also, a multiplier power consumption depends on transistor activity, and you can expect the MSB of the operand not to change too much. For decoder the transistor activity will be high.

> And your citation of "8 instructions in parallel" seems suspicious. Did I just get trolled into a Apple vs. x86 flame war?

Not a troll nor a flame war. I don't use Apple products, mainly because I don't agree with Apple practices. But actually choosing a RISC ISA allows them to decode a lot of instructions in parallel for little energy and complexity.

I chose 8 because it is the maximum that the mainstream will currently see. You might argue that 8 RISC instructions are not comparable with 8 CISC instructions, but even with say 4 CISC instructions it will still consume more energy