top | item 38217219

(no title)

reroute22 | 2 years ago

"128 * the-number-of-cores" of threads can make progress truly in parallel (at the same time).

24,576 threads (or however many, I didn't validate the number and it depends on the occupancy, which depends on thread resource usage, like registers => depends on the shader program code) is how many threads can be executed concurrently (as opposed to in parallel), as in, how many of them can simultaneously reside on the GPU. A subset of those at any time are actually executed in parallel, the rest are idle.

You can think of this situation as follows using an analogy with a CPU and an OS:

1. 128 * the-number-of-cores is the number of CPU cores(*1)

2. 24,576 threads is the number of threads in the system that the OS is switching between

Major differences with the GPU:

3. On a CPU context switch (getting a thread off the core, waking up a different thread, restoring the context, and proceeding) takes about 2,000 cycles. On a GPU _from the analogy_ that kind of thread switching takes ~1-10 cycles depending on the exact GPU design and various other details.

4. In CPU/OS world the context switching and scheduling on the OS side is done mostly in software, as the OS is indeed software. In GPU's case the scheduler and all the switching is implemented as fixed function hardware finely permeating the GPU design.

5. In CPU/OS world those 2,000 cycles per context switch is so much larger than a roundtrip to DRAM while executing a load instruction that happened to miss in all caches - which is about 400-800 cycles or so depending on the design - that OS never switches threads to hide latencies of loads, it's pointless. As far as performance is concerned (as opposed to maintaining the illusion of parallel execution of all programs on the computer), the thread switching is used to hide the latency of IO - non-volatile storage access, network access, user input, etc. (which takes millions of cycles or more - so it makes sense).

In the GPU world the switching is so fast, that the hardware scheduler absolutely does switch from thread to thread to hide latencies of loads (even the ones hitting in half of the caches, if that happens), in fact, hiding these latencies and thus keeping ALUs fed is the whole point of this basic design of pretty much all programmable GPUs that there ever were.

6. In real world CPU/OS, the threads that aren't running at the time reside (their local variables, etc) in the memory hierarchy, technically, some of it ends up in caches, but ultimately, the bulk of it on a loaded system is in system DRAM. On a GPU, or I suppose by now we have to say, on a traditional GPU, these resident threads (their local variables, etc) reside in on-chip SRAM that is a part of the GPU cores (not even in a chunk on a side, but close to execution units in many small chunks, one per core). While the amount of DRAM (CPU/OS) is a) huge, gigabytes, and b) easily configurable, the amount of thread state the GPU scheduler is shuffling around is measured typically in hundreds of KBs per GPU Core (so on the order of about "a few MBs" per GPU), and the equally sized SRAM storing this state is completely hardwired in the silicon design of the GPU and not configurable at all.

Hope that helps!

footnotes (*1) a better analogy would be not "number of CPU cores", but "number-of-CPU-cores * SMT(HT) * number-of-lanes-in-AVX-registers", where number-of-lanes-in-AVX-registers is basically "AVX-register-width / 32" for FP32 processing which (the latter) yields about ~8 give or take 2x depending on the processor model. Whether to include SMT(HT) multiplier (2) in this analogy is also murky, there is an argument to be made for yes, and an argument to be made for no, and depends on the exact GPU design in question.

discuss

xoranth|2 years ago

So, in NVIDIA parlance, my Skylake laptop would have 128 "cuda cores"?

128 = 4 (physical cores) * 2 (hyperthreading) * 8 (AVX2 f32 lanes) * 2 (floating point ports per core)

reroute22|2 years ago

Sorta, yeah!

Also, your "128 cuda cores" of Skylake variety run at higher frequencies and work off of much bigger caches, so they are faster (in serial manner)...

...until they are slower, because GPU's latency hiding mechanism (with occupancy) hides load latencies very well, while CPU just stalls the pipeline on every cache miss for ungodly amounts of time...

...until they are faster again when the shader program uses a lot of registers and GPU occupancy drops to the floor and latency hiding stops hiding that well.

But core counts - yes, more or less.

Const-me|2 years ago

Not quite. These floating-point EUs are shared between both threads of the physical core. I would rather say 64 CUDA cores.