top | item 28728138

Gentle introduction to GPUs inner workings

447 points| ingve | 4 years ago |vksegfault.github.io | reply

55 comments

order
[+] floatboth|4 years ago|reply
> Mesa 3D - driver library that provides open source driver (mostly copy of AMD-VLK)

Wrong, wrong, very wrong. No copy here. Mesa's RADV was developed completely independently from AMD, in fact it predates the public release of AMDVLK. It's also possibly the best Vulkan implementation out there. Valve invested heavily in the ACO compiler backend, so it compiles shaders both very well and very quickly.

[+] MayeulC|4 years ago|reply
Not to mention that there is much more to Mesa than RADV. Gallium3D and state trackers, RadeonSI and a few other drivers (Apple M1, Qualcomm Adreno, Broadcom, Mali) to only name these.

The status of some drivers is tracked here: https://mesamatrix.net/

[+] MayeulC|4 years ago|reply
Looks like they fixed the mistake, and improved the description a lot.

The description could still be improved, and I started commenting on how it could, but I realize this would warrant its own blog post.

[+] mhh__|4 years ago|reply
(Thank you valve!)
[+] ww520|4 years ago|reply
Parallelism with the computing units in GPU permeates the entire computing model. For the longest time I don't get how partial derivative instructions like dFdx/dFdy/ddx/ddy work. None of the doc helps. These instructions take in a number and return its partial derivative, just a generic number, nothing to do with graphic, geometry or function. The number could have been the dollar amount of a mortgage and its partial derivative is returned.

It turns out these functions tied to how the GPU architecture runs the computing units in parallel. The computing units are arranged to run in parallel in a geometric grid according to the input data model. The computing units run the same program in LOCK STEP in parallel (well at least in lock step upon arriving at the dFdx/dFdy instructions). They also know their neighbors in the grid. When a computing unit encounters a dFdx instruction, it reaches across and grabs its neighbors' input value to their dFdx instructions. All the neighbors arrive at dFdx at the same time with their input value ready. With the neighbors' numbers and its own number as the mid point, it can compute the partial derivative using gradient slope.

[+] bla3|4 years ago|reply
That sounds like a useful mental model. But it can't be quite right, can it? There aren't enough cores to do _all_ pixels in parallel, so how is that handled? Does it render tiles and compute all edges several times for this?
[+] bigdict|4 years ago|reply
So the function F is implicitly defined to have value F(x,y) = v, where x and y are coordinates of the core, and v is the input value to the dFdx/dFdy instruction? Then the output of the instruction running on the x,y core (take dFdx for example) is supposedly equal to (F(x+1,y)-F(x-1,y))/2?
[+] anonymous532|4 years ago|reply
I don't sign in often, thank you for this amazing reveal.
[+] dragontamer|4 years ago|reply
For another gentle introduction to GPU architecture, I like "The Graphics Codex", specifically the chapter on Parallel Architectures: https://graphicscodex.courses.nvidia.com/app.html?page=_rn_p...
[+] M277|4 years ago|reply
Thanks a lot, always enjoy your posts on here and r/hardware! Do you have a hardcore introduction with even more detail / perhaps even with examples of implementations? :)

I find white papers quite good (although I admit there are many things I don't understand yet and constantly have to look up), but even these sometimes feel a bit general.

[+] tppiotrowski|4 years ago|reply
I know there's a lot of Javascript developers on this forum. If you want to get into GPU programming, I highly recommend gpu.js [1] library as a jumping off point. It's amazing how powerful computers are and how we squander most our cycles.

[1] https://gpu.rocks/#/

Disclaimer: I have one un-merged PR in the gpu.js repo

[+] tenaciousDaniel|4 years ago|reply
Thanks! I’m a JS dev who happens to be very interested in getting into graphics.
[+] sanketsarang|4 years ago|reply
On the same basis, it would also help if you could provide a comparison between GPUs commonly used for ML. Tesla k80, P100, T4, V100 and A100. How has the architecture evolved to make the A100 significantly faster? Is it just the 80GB RAM, or there is more to it from an architecture standpoint?
[+] einpoklum|4 years ago|reply
> How has the architecture evolved to make the A100 significantly faster?

Oh, very much so. By way more than an order of magnitude. For a deeper read, have a look at the "architecture white papers" for Kepler, Pascal, Volta/Turing, and Ampere:

https://duckduckgo.com/?t=ffab&q=NVIDIA+architecture+white+p...

or check out the archive of NVIDIA's parallel4all blog ... hmm, that's weird, it seems like they've retired it. They used to have really good blog posts explaining what's new in each architecture.

You could also have a look here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

for the table of various numeric sizes and limits which change with different architectures. But that's not a very useful resource in and of itself.

[+] kkielhofner|4 years ago|reply
As a starter T4 is heavily optimized for low power consumption on inference tasks. IIRC it doesn’t even require additional power beyond what the PCIe bus can provide but basically useless for training unlike the others.
[+] touisteur|4 years ago|reply
One day I'll get my hands on both an A40 and an A100 and I'll maybe get an answer to the question: does the 5120bits memory bus help that much? The A100 has less cuda cores, around 1/4 more tensor cores but seems to be the preferred 'compute' and 'ai training' option all around. What gives?
[+] wpietri|4 years ago|reply
Have folks seen good Linux tools for actually monitoring/profiling the GPU's inner workings? Soon I'll need to scale running ML models. For CPUs, I have a whole bag of tricks for examining and monitoring performance. But for GPUs, I feel like a caveman.
[+] zetazzed|4 years ago|reply
For NVIDIA GPUs, Nsight systems is wildly detailed and has both GUI and CLI options: https://developer.nvidia.com/nsight-systems

For DL specifically, this article covers a couple of options that actually plug into the framework: https://developer.nvidia.com/blog/profiling-and-optimizing-d...

nvidia-smi is the core tool most folks use for quick "top"-like output, but there is also an htop equivalent: https://github.com/shunk031/nvhtop

A lot of other tools are build on top of the low-level NVML library (https://developer.nvidia.com/nvidia-management-library-nvml). There are also Python NVML bindings if you need to write your own monitoring tools.

[+] h2odragon|4 years ago|reply
Very nice. This is gentle like movie dinosaurs are "cheeky lizards". I'd hate to see the "Turkish prison BDSM porn" version.

I'm looking at graphics code, again, from a "I know enough C to shoot myself in the foot and want to draw a circle on the screen" perspective. It's hilarious how much "stack" there is in all the ways of doing that; I look at some of this shit and want to go back to Xlib for its simple grace.

[+] Jasper_|4 years ago|reply
2D graphics is very different and mostly doesn't require the GPU's assistance. If you want to plot a circle on an image and then display that to the screen, you don't require any of this stack. If you want a high-level library that will draw a circle for you, you can use something like Skia or Cairo which will wrap this for you into a C API.

GPUs solve problems of much larger scale, and so the stack has evolved over time to meet the needs of those applications. All this power and corresponding complexity has been introduced for a reason, I assure you.

[+] devit|4 years ago|reply
That's very verbose but not super clear.

A better explanation is that the main problems of processor design are that memory reads take 100-1000 times the time of an arithmetic operation and that hardware is faster when operation are run in parallel.

CPUs handle the those issues by having large memory caches, and lots of circuitry to execute instructions "out of order", i.e. run other instructions that don't depend on the memory read result or the result of other operations. This is great to run sequential code as fast as possible, but quite inefficient overall.

GPUs instead handle the memory problem by switching to another thread in hardware, and the parallelism problem by mainly using SIMD (with masking and scatter/gather memory accesses, so they look like multiple threads). This works well if you are doing mostly the same operations on thousands or millions of values, which is what graphics rendering and GPGPU is.

Then there are also DSPs that solve the memory access issue by only having very little on-CPU memory and either explicit DMA or memory reads giving a delayed result, and parallelism by being VLIW.

And finally the in-order/microcontroller CPUs that simply don't care about performance and do the cheapest and simplest thing.

[+] pixelpoet|4 years ago|reply
This article says that AMD GPUs are vector in nature, but I think that stopped being the case with GCN; before that they had some weird vector stuff with 5 elements or something.
[+] dragontamer|4 years ago|reply
Wrong. GPUs are definitely vector (GCN in particular being 64 x 32-bit wide vectors).

What you're confusing is "Terascale" (aka: the 6xxx series from the 00s), which was VLIW _AND_ SIMD. Which was... a little bit too much going on and very difficult to optimize for. GCN made things way easier and more general purpose.

Terascale was theoretically more GFLOPs than the first GCN processors (after all, VLIW unlocks a good amount of performance), but actually utilizing all those VLIW units per clock tick was a nightmare. Its hard enough to write good SIMD code as it is.

[+] sidewinder128|4 years ago|reply
Awesome read, I get more how GPUs works thank you to the author!.
[+] ai_ja_nai|4 years ago|reply
Not gentle at all :) Requires familiarity.

Also, how about Mesa3D copying Vulkan? It predates it.

[+] jimmyvalmer|4 years ago|reply
> GTX version of Turing architecture (1660, 1650) has no Tensor cores, instead > it has freely available FP16 units!

It's always a bad sign when an author's exclamation has all the surprise factor of a tax code. As a previous poster said, "gentle" is relative.