top | item 40653031

(no title)

avianes | 1 year ago

> GPUs are massively parallel devises, they need to keep the scheduler and ALU logic as simple and compact as possible

The simplest hardware implementation is not always the more compact or the more efficient. This is a misconception, example bellow.

> SIMT is just SIMD with some additional capabilities for control flow ..

In the Nvidia uarch, it does not. The key part of the Nvidia uarch is the "operand-collector" and the emulation of multi-ports register-file using SRAM (single or dual port) banking. In a classical SIMD uarch, you just retrieve the full vector from the register-file and execute each lane in parallel. While in the Nvidia uach, each ALU have an "operand-collector" that track and collect the operands of multiple in-flight operations. This enable to read from the register-file in an asynchronous fashion (by "asynchronous" here I mean not all at the same cycle) without introducing any stall.

When a warp is selected, the instruction is decoded, an entry is allocated in the operand-collector of each used ALU, and the list of register to read is send to the register-file. The register-file dispatch register reads to the proper SRAM banks (probably with some queuing when read collision occur). And all operand-collectors independently wait for their operands to come from the register-file, when an operand collector entry has received all the required operands, the entry is marked as ready and can now be selected by the ALU for execution.

That why (or 1 of the reason) you need to sync your threads in the SIMT programing model and not in an SIMD programming model.

Obviously you can emulate an SIMT uarch using an SIMD uarch, but a think it's missing the whole point of SIMT uarch.

Nvidia do all of this because it allow to design a more compact register-file (memories with high number of port are costly) and probably because it help to better use the available compute resources with masked operations

discuss

ribit|1 year ago

In an operand-collector architecture the threads are still executed in lockstep. I don't think this makes the basic architecture less "SIMD-y". Operand collectors are a smart way to avoid multi-ported register files, which enables more compact implementation. Different vendors use different approaches to achieve a similar result. Nvidia uses operand collectors, Apple uses explicit cache control flags etc.

> This enable to read from the register-file in an asynchronous fashion (by "asynchronous" here I mean not all at the same cycle) without introducing any stall.

You can still get stalls if an EU is available in a given cycle but not all operands have been collected yet. The way I understand the published patents is that operand collectors are a data gateway to the SIMD units. The instructions are alraedy scheduled at this point and the job of the collector is to sgnal whether the data is ready. Do modern Nvidia implementations actually reorder instructions based feedback from operand collectors?

> That why (or 1 of the reason) you need to sync your threads in the SIMT programing model and not in an SIMD programming model.

It is my understanding that you need to synchronize threads when accessing shared memory. Not only different threads can execute on different SIMD, but also threads on the same SIMD can access shared memory over multiple cycles on some architectures. I do not see how thread synthconization relates to operand collectors.

avianes|1 year ago

> In an operand-collector architecture the threads are still executed in lockstep. > [...] > It is my understanding that you need to synchronize threads when accessing shared memory.

Not sure what you mean by lockstep here. When an operand-collector entry is ready it dispatch it to execute as soon as possible (write arbitration aside) even if other operand-collector entries from the same warp are not ready yet (so not really what a would call "threads lock-step"). But it's possible that Nvidia enforce that all threads from a warp should complete before sending the next warp instruction (I would call it something like "instruction lock-step"). This can simplify data dependency hazard check. But that an implementation detail, it's not required by the SIMT scheme.

And yes, it's hard to expose de-synchronization without memory operations, so you only need sync for memory operation. (load/store unit also have operand-collector)

> You can still get stalls if an EU is available in a given cycle but not all operands have been collected yet

That's true, but you have multiple multiple operand-collector entry to minimize the probability that no entry is ready. I should have say "to minimize bubbles".

> The way I understand the published patents is that operand collectors are a data gateway to the SIMD units. The instructions are alraedy scheduled at this point and the job of the collector is to sgnal whether the data is ready. Do modern Nvidia implementations actually reorder instructions based feedback from operand collectors?

Calling UE "SIMD unit" in an SIMT uarch add a lot of ambiguity, so I'm not sure a understand you point correctly. But, yes (warp) instruction is already scheduled, but (ALU) operation are re-scheduled by the operand-collector and it's dispatch. In the Nvidia patent they mention the possibility to dispatch operation in an order that prevent write collision for example.