top | item 40669606

(no title)

ribit | 1 year ago

> Yes, that my understanding, and that's why I claim it's different from "classical" SIMD

I understand, yes, it makes sense. Of course, other architectures can make other optimizations, like selecting warps that are more likely to have data ready etc., but Nvidia's implementation does sound like a very smart approach

> And let say you have 2 warps with complementary masking, with the Nvidia's SIMT uarch it goes naturally to issue both warps simultaneously and they can be executed at the same cycle within different ALU/core

That is indeed a powerful technique

> It's not obvious what would mean "superscalar" in an SIMT context. For me a superscalar core is a core that can extract instruction parallelism from a sequential code (associated to a single thread) and therefore dispatch/issue/execute more that 1 instruction per cycle per thread.

Yes, I meant executing multiple instructions from the same warp/thread concurrently, depending on the execution granularity of course. Executing instructions from different warps in the same block is slightly different, since warps don't need to be at the same execution state. Applying the CPU terminology, warp is more like a "CPU thread". It does seem like Nvidia indeed moved quite far into the SIMT direction and their threads/lanes can have independent program state. So I thin I can see the validity of your arguments that Nvidia can remap SIMD ALUs on the fly to suitable threads in order to achieve high hardware utilization.

> In the Nvidia case a "register-file cache" is a cache placed between the register-file and the operand-collector. And it makes sense in their case since the register-file have variable latency (depending on collision) and because it will save SRAM read power.

Got it, thanks!

P.S. By the way, wanted to thank you for this very interesting conversation. I learned a lot.

discuss

No comments yet.