What every developer should know about GPU computing

[+] dang|2 years ago|reply

Someone emailed to complain about this:

https://twitter.com/abhi9u/status/1715753871564476597

That is against HN's rules. In fact, it's the one thing that's important enough to be in both the site guidelines and FAQ. HN users feel extremely strongly about this.

Q: Can I ask people to upvote my submission?

A: No. Users should vote for a story because they personally find it intellectually interesting, not because someone has content to promote. We penalize or ban submissions, accounts, and sites that break this rule, so please don't.

https://news.ycombinator.com/newsfaq.html

Don't solicit upvotes, comments, or submissions. Users should vote and comment when they run across something they personally find interesting—not for promotion.

https://news.ycombinator.com/newsguidelines.html

[+] abhi9u|2 years ago|reply

Sorry, I was not aware of this rule. Although it was submitted by someone I don't know.

But I won't do it again now that I know.

[+] 01100011|2 years ago|reply

> Copying Data from Host to Device

Surprised there's no mention of async copies here. If you want to get the most out of the GPU, you don't want it idle when copying data between the host and the GPU. Many frameworks provide for a mechanism to schedule async copies which can execute along side async work submission.

The post is sort of GPU 101 but there's a whole world of tricks and techniques beyond that once you start doing real-world GPU programming where you want to squeeze as much out of the expensive GPU as possible. Profiling tools help a lot here because, like much of optimizing now, there are hidden cliffs and non-linearities all over that you have to be aware of.

[+] nine_k|2 years ago|reply

Since you likely use 64-bit (double) floats, not every GPU would help much, especially compared to a beefy CPU.

But if you use a GPU with a large number of FP64 units, it may speed things up a lot. These are generally not gaming GPUs, but if you have a 4060 sitting around anyway, it has about 300 GFLOPS FP64 performance, likely more than your CPU. Modern CPUs are mighty in this regard though, able to issue many FP64 operations per clock per core.

[+] permo-w|2 years ago|reply

>Most programmers have an intimate understanding of CPUs

maybe this article is brilliant, but when the first line is something so blatantly untrue it really makes it hard to take the rest seriously

[+] PTOB|2 years ago|reply

Try this on: "A non-trivial number of Computer Scientists, Computer Engineers, Electrical Engineers, and hobbyists have ..."

Took some philosophy courses for fun in college. I developed a reading skill there that lets me forgive certain statements by improving them instead of dismissing them. My brain now automatically translates over-generalizations and even outright falsehoods into rationally-nearby true statements. As the argument unfolds, those ideas are reconfigured until the entire piece can be evaluated as logically coherent.

The upshot is that any time I read a crappy article, I'm left with a new batch of true and false premises or claims about topics I'm interested in. And thus my mental world expands.

[+] anonporridge|2 years ago|reply

Definitely not true about most programmers, but maybe the author meant CS educated engineers. Going through a formal CS program will give you an intimate understanding of CPUs, especially when compared to the very light coverage of GPUs.

[+] twixfel|2 years ago|reply

I don't understand why every other submission on the internet has to have at least one "stopped reading at X" comment relating to it. It adds absolutely nothing.

[+] moritzwarhier|2 years ago|reply

I think at least 50% of the answers of this would depend on how one defines "intimate understanding"...

I learned basic facts about CPU architectures at university, know in a very basic way the landscape of things and occasionaly stumble upon updates to my limited knowledge... but by no means would I say that, rather like "a basic understanding of how CPUs work / are designed / are to be used" (?)

If I were proficient in assembler, maybe I'd claim to have an "intimate understanding" of how to use CPUs at a low level (still sounds a bit braggy)

It still is not the same though as being an expert in CPU/GPU design.

So yeah I agree.

Article is interesting though, esp. the diagram!

[+] rjh29|2 years ago|reply

I learned it both in my degree and the Structure and Interpretation of Computer Programs course (which I recommend to anyone interested in low level computing)

[+] njacobs5074|2 years ago|reply

Agree that saying "intimate understanding" is a bit off the mark. Had the author written "intuitive understanding", it would have made a bit more sense.

However, given the prevalence of the von Neumann computing architecture, I don't think it's completely off the mark - even if people don't know von Neumann's name :)

[+] kalak|2 years ago|reply

feels very much like https://xkcd.com/2501/

[+] ashu|2 years ago|reply

And this is the most insightful thing you had to say about this?! Pfft.

[+] Const-me|2 years ago|reply

> During execution the registers allocated to a thread are private to it, i.e., other threads cannot read/write those registers.

Wave intrinsics in HLSL, and similar CUDA things can read registers from different threads within the current wavefront.

Also, in the paragraph about memory architecture, I would mention the caches provide no coherency guarantees across threads of the same dispatch/grid, but there’s a special functional block global to the complete chips which implements atomics on global memory.

[+] paulddraper|2 years ago|reply

SIMD programming is f---ing wild.

Want to run a calculation for every pixel on your screen? No problem.

Want to have a branching condition? Ouchie.

[+] eimrine|2 years ago|reply

Want to have eval? Stop everything.

[+] winwang|2 years ago|reply

To be fair, this makes sense: making a smart decision is "harder" than scaling a simple calculation out to a bunch of workers.

[+] toppy|2 years ago|reply

Why are they still called GPU? PPU (Parallel Processing Unit) sounds like a better name.

[+] zackmorris|2 years ago|reply

This is a great writeup. And GPUs are more advanced/performant for what they do than anything I could ever come up with.

But I put SIMD in the category of something that isn't necessary once one has learned other (more flexible) paradigms. I prefer MIMD and clusters/transputers, which seem to have died out by the 2000s. Today's status quo puts the onus on developers to move data manually, write shaders under arbitrary limitations on how many memory locations can be accessed simultaneously, duplicate their work with separate languages for GPU/CPU, know if various hardware is available for stuff like ray tracing, get locked into opinionated frameworks like OpenGL/Metal/Vulkan, etc etc etc. GPUs are on a side tangent that can never get me to where I want to go, so my experience over the last 25 years has been of a person living on the wrong timeline. I've commented about it extensively but it just feels like yelling into the void now.

Loosely, a scalable general purpose CPU working within the limitations of the end of Moore's law is multicore with local memories, sharing data through a copy-on-write content-addressable memory or other caching scheme which presents a single unified address space to allow the user to freely explore all methods of computation in a desktop computing setting. It uses standard assembly language but is usually programmed with something higher level like Erlang/Go, Octave/MATLAB or ideally a functional programming language like Julia. 3D rendering and AI libraries are written as a layer above that, they aren't fundamental.

It's interesting that GPUs have arrived at roughly the multicore configuration that I spoke of, but with drivers that separate the user from the bare-metal access needed to do general purpose MIMD. I had thought that FPGAs were the only way to topple GPU dominance, but maybe there is an opportunity here to write a driver that presents GPU hardware as MIMD with a unified memory. I don't know how well GPU cores handle integer math, but that could be approximated with the 32 bit int portion of a 64 bit float. Those sorts of tradeoffs may result in a MIMD machine running 10-100 times slower than a GPU, but still 10-100 times faster than a CPU. But scalable without the over-reliance on large caches and fast busses which stagnated CPUs since around 2007 when affordability and power efficiency took priority over performance due to the mobile market taking over. And MIMD machines can be clustered and form distributed compute networks like SETI@home with no changes to the code. To get a sense of how empowering that could be to the average user: it's like comparing BitTorrent to FTP, but for compute instead of data.

[+] guidedlight|2 years ago|reply

One thing I don’t understand is how the architecture of Apple Silicon is different from NVidia’s.

Looking at this quote:

> the Nvidia H100 GPU has 132 SMs with 64 cores per SM, totalling a whopping 8448 cores.

8448 cores sure sounds impressive. But the Apple M2 Ultra only has 76 cores?!

How can the NVidia H100 GPU have over 110x more cores? Clearly it doesn’t have 110x more performance over the M2 Ultra, so what is going on here?

[+] mannyv|2 years ago|reply

Now I understand why ML uses floats for precision. It wasn't a choice, it was because graphics code uses them.

Another piece in the "why is ML so inefficient" puzzle!

I wonder what that memory copying overhead is IRL. If it's like normal stuff it'll be brutal. I mean, they offload tcp processing into hardware to avoid that. This is way more data, though it is done in bigger chunks.

[+] diimdeep|2 years ago|reply

Also check out this talk and slides from few years ago about CPU and GPU nitpicks

Alexander Titov — Know your hardware: CPU memory hierarchy https://youtu.be/QOJ2hsop6hM

https://github.com/alexander-titov/public/blob/master/confer...

Know Your Hardware - CPU Memory Hierarchy -- Alexander Titov -- C%2B%2B Moscow Meetup March 2019.pdf

https://github.com/alexander-titov/public/blob/master/confer...

GPGPU - what it is and why you should care -- Alexander Titov -- CoreHard 2019.pdf

[+] Jeff_Brown|2 years ago|reply

Not every developer.

I'm not trying to be snarky. I think there's an unhelpful compunction to want to know everything about everything among STEM types like programmers (of which I am one). Specialization is fundamental to the success, not just of whole economies, but of the individuals in them. It can feel like a paintful sacrifice to admit that you'll never (have time to) learn, say, the entire Python language specification, or how type inference works, or any number of other things someone might tell you is critical knowledge. But it's often liberating, and more often than that, mandatory.

(Maybe I was trying to be a little snarky initially, but I'm not any more.)

[+] yodsanklai|2 years ago|reply

> It can feel like a paintful sacrifice to admit that you'll never (have time to) learn

And even if you do have time, you'll probably forget most of it if you don't make use of it in your daily job. Also if you do want to learn about a new topic, it'll take commitment. I remember going through an MIT OS project, it took me something like 50 hours of hours to complete the project. Pretty much impossible when you have a full time job (I didn't at the time). And despite that, I still consider myself a newbie in OS development.

That being said, a little extra knowledge can come handy and make a difference in an interview for instance, or reduce ramp-up time when changing teams.

This is also what school is for. Give you full time and a structured program to pick the fundamentals. There's only so much you can learn once you have a demanding job. It's actually pretty sad we don't get to go back to school in the middle of our career.

Edit: still an interesting article ;)

[+] loeg|2 years ago|reply

At this point, I think the running shtick / inside joke of "Every Developer Should Know ..." headlines is that of course every developer doesn't need to know the contents of the article that follows.

[+] cratermoon|2 years ago|reply

I dunno. The section "Latency Tolerance, High Throughput and Little’s Law" has many applications in programming. Ever need to scale your cache, or size a connection pool? Little's Law.

[+] Agingcoder|2 years ago|reply

I’d say very few actually ( and I say this from the perspective of someone who used to work in hpc ). If they need to know about hw, most devs need to know about their primary platform, ie cpu. Gpus for general purpose computing ( I’m deliberately excluding games here, and even then it’s not obvious ) and programmed by people who don’t write ml/hpc libraries are far from ubiquitous.

Yes, you want to know as much as possible ( helps debugging/zooming in on issues since you don’t need to introduce an outsider to your problem, helps avoid errors), yes you need to specialize somewhere, no you can’t know everything and often don’t need to

[+] nabla9|2 years ago|reply

This is just minimum for those don't need to know details.

Even general knowledge has depth. Even if you are generalist, or specialist in different area you should deepen your knowledge at every area gradually.

[+] pjmlp|2 years ago|reply

Indeed, one thing that most seniors learn is humility and being able to say I don't know, without caring about consequences.

[+] crabbone|2 years ago|reply

You are interpreting the title literally, while all this article is trying to do is to give an introduction to anyone who wants to get into GPU programming.

Now, about every developer. Ideally, developers should know something, in general, about all fields related to programming. Similar to how in medicine you need to learn about different branches of medicine, even though you'll specialize in one, and so it is in mathematics, physics and so on.

Presently, the demand for quality professionals in programming is very low. There aren't any good testing or certification programs that can tell a good programmer from a bad one. The industry is generally happy with "specialists" who perhaps only know to do one thing, somewhat. So, presently you don't need to know anything about GPU or any other field that's not directly related to your job description.

----

Now, about the article itself. While it gives a lot of valuable factual information, it's missing the forest for the trees. It's very dedicated to how CUDA works or some other particular aspects of NVidia's GPUs. The part that's missing is the part that could make it, potentially, a candidate for the kind of introduction to GPU programming that would make it worth reading to expand your general understanding of how computers work.

If you ever paid attention to how encyclopedic articles are written: the structure of a definition given by encyclopedia often has two components. First puts the object of the definition into the more general category, second explains how the object of the definition is different from other elements of the category. What the GPU article is missing is the first component: putting GPU programming into the more general category. This, in practical terms, means that questions like "is DPU programming anything like GPU programming?" or "can `smart' SSDs (with FPGAs) be treated similar to GPUs?" unanswered.

[+] cj|2 years ago|reply

One thing I find interesting about the software industry is our lack of descriptive job titles.

At big companies you have SWE L1, L2, L3, senior, staff, principal, etc. You also have SRE and maybe some devops or architect roles.

Smaller companies you have lots of people generic “engineer” titles or “full stack engineers”, etc.

Why don’t we encode people’s specialties in their titles if most engineers are working on narrow sections of software?

E.g. “React & Node Developer” instead of “Full stack engineer” … etc

I suppose the easiest rationale is generic titles allow for easier mobility between disciplines.

[+] nativeit|2 years ago|reply

I can appreciate what you’re getting at in mourning the absence of greater opportunities for in-depth learning, but I personally value and appreciate the learning process such that I am overwhelmed by gratitude that I will likely never be without something to hold my interest. I have a deep understanding of the things I use in my daily work, but I think holding a breadth of knowledge is also useful in that you have a higher appreciation for what other specialists know and do, and in the event you need something novel, you may have a head start in getting to the knowledge you need at the time. I view it as unreservedly positive to audit many subjects even if you cannot engage with them further.

[+] Jiro|2 years ago|reply

Over-specialization is hard on your ability to find another job if you lose your current one.

[+] unknown|2 years ago|reply

[deleted]

[+] nologic01|2 years ago|reply

It is anybody's guess what the future will bring, but on past form GPU programming will remain a niche for specific highly-tuned (HPC) applications and mere mortals can focus on somewhat easier multi-core CPU programming instead.

The main reason GPU gets so much attention these days is that CPU manufacturers (Intel in particular) simply cannot get their act together. Intel had promised significant breakthroughs with Xeon Phi like a decade ago.

In the meantime people have invented more and more applications that need significant computational power. But it will eventually get there. E.g., AMD' latest epyc features 96 cores. Importantly, that computational power is available in principle with simpler / more familiar programming models.

[+] qwertox|2 years ago|reply

Let's assume I have an array of 10.000 lat/lng-pairs. I want to compute the length of the track. I duplicate the array and remove the first item in the duplicated array, append the last entry of the original array to the duplicate in order for them to be equal in length.

Then I use a vectorized haversine algorithm on these arrays to obtain a third one with the distances between each "row" of the two arrays.

With NumPy this is fast, but I guess that a GPU could also perform this and likely do it much faster. Would it be worth it if one has to consider that the data needs to be moved to the GPU and the result be retrieved?

[+] kylebarron|2 years ago|reply

10,000 coordinates are certainly not enough to see the difference, but at some scale this would be faster on the GPU.

This is implemented in an nvidia geospatial library call cuspatial: https://docs.rapids.ai/api/cuspatial/legacy/api_docs/spatial...

[+] rational_indian|2 years ago|reply

You need to try it and see if it is any faster. NVIDIA has a drop-in replacement for numpy: https://developer.nvidia.com/cunumeric.

[+] dataflow|2 years ago|reply

Not the answer to your question, but:

> I duplicate the array and remove the first item in the duplicated array, append the last entry of the original array to the duplicate in order for them to be equal in length.

I assume/hope this is only what you're doing logically, not physically?

Otherwise you might as well just compute the length using n - 1 points from the existing array, then do the remaining portion manually and add it to the existing sum. That would avoid the copying of the whole array.

[+] tlb|2 years ago|reply

Probably not. The computation is only a few trig instructions per array element, so most of the time is moving data on either CPU or GPU.

[+] dist-epoch|2 years ago|reply

> Would it be worth it if one has to consider that the data needs to be moved to the GPU and the result be retrieved

Depends on your batch size. If the computation on the CPU is less than lets say 200 ms it's probably not worth it.

Also consider that integrated GPUs don't have separate memory, I'm not sure, but they might not have a high cost of moving data to memory.

[+] jokoon|2 years ago|reply

I wish it was easier to program a GPU...

I've already refrained myself to learn vulkan because it scares me, but similarly, opengl and cuda are a bit mysterious to me, and I don't really know how I could take advantage of it, since most computing tasks cannot be made parallel.

I've read there are data structures that are somehow able to take advantage of a GPU as an alternative to the CPU (for example a database running on a GPU), but it seems a very new domain, and I don't have the skill to explore it.

[+] rakkhi|2 years ago|reply

Imagine with NVIDIA banned in china, how well the Chinese local companies will do in GPU's for AI: https://x.com/BeijingDai/status/1715861773495279743?s=20

[+] kdwikzncba|2 years ago|reply

> We can understand this with the help of Little’s law from queuing theory. It states that the average number of requests in the system (Qd for queue depth) is equal to the average arrival rate of requests (throughput T) multiplied by the average amount of time to serve a request (latency L).

First off, this is obviously false. If you can serve 9req/s and you're getting 10req/s the size of the queue depth is growing at a rate of 1req/s. It's not stationary.

Second, what's the connection between this and gpus? What's the queue? What's the queue depth? What are the requests?

Seems to me that the article focuses more on being smart than actually learning.

[+] minitoar|2 years ago|reply

The scenario is that you’re calculating Qd given a static average latency. Absent that, this formula doesn’t give you a way to compute Qd. What is the average amount of time to service a request in a system where the queue depth is growing without bound?

[+] ssivark|2 years ago|reply

> First off, this is obviously false. If you can serve 9req/s and you're getting 10req/s the size of the queue depth is growing at a rate of 1req/s. It's not stationary.

I haven’t formally studied any queuing theory, but I think:

1. The rule assumes you have enough processing power to service the average load (otherwise it fails catastrophically like you mentioned)

2. The rule is trying to model the fluctuations in the pending load (which might determine wait time or whatever else).

[+] osigurdson|2 years ago|reply

>> Most programmers have an intimate understanding of CPUs

I'd say the mental model for most programmers is: lines of text, in their language of choice, zipping by really fast.

[+] RevEng|2 years ago|reply

This is one of the better explanations I've seen of how GPU programming works. I'll be using this for my mentees in the future. Well done!

[+] fatih-erikli|2 years ago|reply

I think GPU computing should not be done in application layer. It's way too low-level.

[+] markhahn|2 years ago|reply

it's not just nvidia-specific, but nvidia-biased. the misuse of "core", pretending that CPUs are backwards, etc.

every developer should know about more than nvidia's spin.

176 comments