How to Think About GPUs

hackrmn|6 months ago

I find the piece, much like a lot of other documentation, "imprecise". Like most such efforts, it likely caters to a group of people expected to benefit from being explained what a GPU is, but it fumbles it terms, e.g. (the first image with burned-in text):

> The "Warp Scheduler" is a SIMD vector unit like the TPU VPU with 32 lanes, called "CUDA Cores"

It's not clear from the above what a "CUDA core" (singular) _is_ -- this is the archetypical "let me explain things to you" error most people make, in good faith usually -- if I don't know the material, and I am out to understand, then you have gotten me to read all of it but without making clear the very objects of your explanation.

And so, for these kind of "compounding errors", people who the piece was likely targeted at, are none the wiser really, while those who already have a good grasp of the concepts attempted explained, like what a CUDA core actually is, already know most of what the piece is trying to explain anyway.

My advice to everyone who starts out with a back of envelope cheatsheet then decides to publish it "for the good of mankind", e.g. on Github: please be surgically precise with your terms -- the terms are your trading cards, then come the verbs etc. I mean this is all writing 101, but it's a rare thing, evidently. Don't mix and match terms, don't conflate them (the reader will do it for you many times over for free if you're sloppy), and be diligent with analogies.

Evidently, the piece may have been written to help those already familiar with TPU terminology -- it mentions "MXU" but there's no telling what that is.

I understand I am asking for a tall order, but the piece is long and all the effort that was put in, could have been complemented with minimal extra hypertext, like annotated abbreviations like "MXU".

I can always ask $AI to do the equivalent for me, which is a tragedy according to some.

jacobaustin123|6 months ago

Shamelessly responding as the author. I (mostly) agree with you here.

> please be surgically precise with your terms

There's always a tension between precision in every explanation and the "moral" truth. I can say "a SIMD (Single Instruction Multiple Data) vector unit like the TPU VPU with 32 ALUs (SIMD lanes) which NVIDIA calls CUDA Cores", which starts to get unwieldy and even then leaves terms like vector units undefined. I try to use footnotes liberally, but you have to believe the reader will click on them. Sidenotes are great, but hard to make work in HTML.

For terms like MXU, I was intending this to be a continuation of the previous several chapters which do define the term, but I agree it's maybe not reasonable to assume people will read each chapter.

There are other imprecisions here, like the term "Warp Scheduler" is itself overloaded to mean the scheduler, dispatch unit, and SIMD ALUs, which is kind of wrong but also morally true, since NVIDIA doesn't have a name for the combined unit. :shrug:

I agree with your points and will try to improve this more. It's just a hard set of compromises.

einpoklum|6 months ago

> It's not clear from the above what a "CUDA core" (singular) _is_

A CUDA core is basically a SIMD lane on an actual core on an NVIDIA GPUs.

For a longer version of this answer: https://stackoverflow.com/a/48130362/1593077

hyghjiyhu|6 months ago

Interestingly, I find llms are really good for this problem; when looking up one term just leads to more unknown terms and you struggle to find a starting point from which to understand the rest, I have found that they can tell you where to start.

evertedsphere|6 months ago

https://cloud.google.com/tpu/docs/system-architecture-tpu-vm

should have most of it

robbies|6 months ago

I’m being earnest: what is an appropriate level of computer architecture knowledge? SIMD is 50 years old.

From the resource intro: > Expected background: We’re going to assume you have a basic understanding of LLMs and the Transformer architecture but not necessarily how they operate at scale.

I suppose this doesn’t require any knowledge about how computers work, but core CPU functionality seems…reasonable?

pseudosavant|6 months ago

My recursive brain got a chuckle out of wondering about "imprecise" being in quotes. I found the quotes made the meaning a touch...imprecise.

While I can understand the imprecise point, I found myself very impressed by the quality of the writing. I don't envy making digestible prose about the differences between GPUs and TPUs.

uberduper|6 months ago

This is a chapter in a book targeting people working in the machine learning domain.

tormeh|6 months ago

I find it very hard to justify investing time into learning something that's neither open source nor has multiple interchangeable vendors. Being good at using Nvidia chips sounds a lot like being an ABAP consultant or similar to me. I realize there's a lot of money to be made in the field right now, but IIUC historically this kind of thing has not been a great move.

raincole|6 months ago

Yeah that was what I told myself a decade ago when I skipped CUDA class during college time.

the__alchemist|6 months ago

The principles of parallel computing, and how they work at hardware and driver levels are more broad. Some parts of it are provincial (Strong province though...), and others are more general.

It's hard to find skills that don't have a degree of provincialism. It's not a great feeling, but you more on. IMO, don't over-idealize the concept of general-knowledge to your detriment.

I think we can also untangle the open-source part from the general/provincial. There is more to the world worth exploring.

physicsguy|6 months ago

It really isn't that hard to pivot. It's worth saying that if you were already writing OpenMP and MPI code then learning CUDA wasn't particularly difficult to get started, and learning to write more performant CUDA code would also help you write faster CPU bound code. It's an evolution of existing models of compute, not a revolution.

saagarjha|6 months ago

Sure, but you can make money in the field and retire faster than it becomes irrelevant. FWIW none of the ideas here are novel or nontransferable–it's just the specific design that is proprietary. Understanding how to do an AllReduce has been of theoretical interest for decades and will probably remain worth doing far into the future.

hackrmn|6 months ago

I grew up learning programming on a genuine IBM PC running MS-DOS, neither of which was FOSS but taught me plenty that I routinely rely on today in one form or another.

Philpax|6 months ago

There's more in common with other GPU architectures than there are differences, so a CUDA consultant should be able to pivot if/when the other players are a going concern. It's more about the mindset than the specifics.

qwertox|6 months ago

It's a valid point of view, but I don't see the value in sharing it.

There are enough people for who it's worth it, even if just for tinkering, and I'm sure you are aware of that.

It reads a bit like "You shouldn't use it because..."

Learning about Nvidia GPUs will teach you a lot about other GPUs as well, and there are a lot of tutorials about the former, so why not use it if it interests you?

pornel|6 months ago

There are two CUDAs – a hardware architecture, and a software stack for it.

The software is proprietary, and easy to ignore if you don't plan to write low-level optimizations for NVIDIA.

However, the hardware architecture is worth knowing. All GPUs work roughly the same way (especially on the compute side), and the CUDA architecture is still fundamentally the same as it was in 2007 (just with more of everything).

It dictates how shader languages and GPU abstractions work, regardless of whether you're using proprietary or open implementations. It's very helpful to understand peculiarities of thread scheduling, warps, different levels of private/shared memory, etc. There's a ridiculous amount of computing power available if you can make your algorithms fit the execution model.

deltaburnt|6 months ago

This is a JAX article, a parallel computation library that's meant to abstract away vendor specific details. Obviously if you want the most performance you need to know specifics of your hardware, but learning the high level of how a GPU vs TPU works seems like useful knowledge regardless.

bee_rider|6 months ago

I think I’d rather get familiar with cupy or Jax or something. Blas/lapack wrappers will never go out of style. It is a subset of the sort of stuff you can do on a GPU but it seems like a nice effort:functionality reward ratio.

moralestapia|6 months ago

It's money. You would do it for money.

augment_me|6 months ago

You can write software for the hardware in a cross-compiled language like Triton. The hardware reality stays the same, a company like Cerebras might have the superior architecture, but you have server rooms filled with H100, A100, and MI300s whether you believe in the hardware or not.

WithinReason|6 months ago

What's in this article would apply to most other hardware, just with slightly different constants

j45|6 months ago

Nvidia also trotted along with a low share price for a long time financing and supporting what they believed in.

When cuda rose to prominence were there any viable alternatives?

rvz|6 months ago

> I find it very hard to justify investing time into learning something that's neither open source nor has multiple interchangeable vendors.

Better not learn CUDA then.

amelius|6 months ago

I mean it is similar to investing time in learning assembly language.

For most IT folks it doesn't make much sense.

nickysielicki|6 months ago

The calculation under “Quiz 2: GPU nodes“ is incorrect, to the best of my knowledge. There aren’t enough ports for each GPU and/or for each switch (less the crossbar connections) to fully realize the 450GB/s that’s theoretically possible, which is why 3.2TB/s of internode bandwidth is what’s offered on all of the major cloud providers and the reference systems. If it was 3.6TB/s, this would produce internode bottlenecks in any distributed ring workload.

Shamelessly: I’m open to work if anyone is hiring.

aschleck|6 months ago

It's been a while since I thought about this but isn't the reason providers advertise only 3.2tbps because that's the limit of a single node's connection to the IB network? DGX is spec'ed to pair each H100 with a Connect-X 7 NIC and those cap out at 400gbps. 8 gpus * 400gbps / gpu = 3.2tbps.

Quiz 2 is confusingly worded but is, iiuc, referring to intranode GPU connections rather than internode networking.

einpoklum|6 months ago

We should remember that these structural diagrams are _not_ necessarily what NVIDIA actually has as hardware. They carefully avoid guaranteeing that any of the entities or blocks you see in the diagrams actually _exist_. It is still just a mental model NVIDIA offers for us to think about their GPUs, and more specifically the SMs, rather than a simplified circuit layout.

For example, we don't know how many actual functional units an SM has; we don't know if the "tensor core" even _exists_ as a piece of hardware, or whether there's just some kind of orchestration of other functional units; and IIRC we don't know what exactly happens at the sub-warp level w.r.t. issuing and such.

KeplerBoy|6 months ago

Interesting perspective. Aren't SMs basically blocked while running tensor core operations, which might hint that it's the same FPUs doing the work after all?

gregorygoc|6 months ago

It’s mind boggling why this resource has not been provided by NVIDIA yet. It reached the point that 3rd parties reverse engineer and summarize NV hardware to a point it becomes an actually useful mental model.

What are the actual incentives at NVIDIA? If it’s all about marketing they’re doing great, but I have some doubts about engineering culture.

robbies|6 months ago

As a real time rendering engineer, this is how it’s always been. NV obfuscates much of the info to prevent competitors from understanding changes between generations. Other vendors aren’t great at this either.

In games, you can get NDA disclosures about architectural details that are closer to those docs. But I’ve never really seen any vendor (besides Intel) disclose this stuff publicly

threeducks|6 months ago

With mediocre documentation, NVIDIAs closed-source libraries, such as cuBLAS and cuDNN, will remain the fastest way to perform certain tasks, thereby strengthening vendor lock-in. And of course it makes it more difficult for other companies to reverse engineer.

hackrmn|6 months ago

Plenty of circumstantial evidence pointing to the fact NVIDIA prefers to hand out semi-tailored documentaion resources to signatories and other "VIPs", if not the least to exert control over who and how uses their products. I wouldn't put it past them to routinely neglect their _public_ documentation, for one reason or another that makes commercial sense to them but not the public. As for incentives, go figure indeed -- you'd think by walling off API documentation, they're shooting themselves in the feet every day, but in these days of betting it all on AI, which means selling GPUs, software and those same NDA-signed VIP-documentation articles to "partners", maybe they're all set anyway and care even less for the odd developer who wants to know how their flagship GPU works.

KeplerBoy|6 months ago

Nvidia has ridiculously good documentation for all of this compared to its competitors.

dahart|6 months ago

What makes you think that? It appears most of this material came straight out of NVIDIA documentation. What do you think is missing? I just checked and found the H100 diagram for example is copied (without being correctly attributed) from the H100 whitepaper: https://resources.nvidia.com/en-us-hopper-architecture/nvidi...

Much of the info on compute and bandwidth is from that and other architecture whitepapers, as well as the CUDA C++ programming guide, which covers a lot of what this article shares, in particular chapters 5, 6, and 7. https://docs.nvidia.com/cuda/cuda-c-programming-guide/

There’s plenty of value in third parties distilling and having short form versions, and of writing their own takes on this, but this article wouldn’t have been possible without NVIDIA’s docs, so the speculation, FUD and shade is perhaps unjustified.

gchadwick|6 months ago

This whole series is fantastic! Does an excellent job of explaining the theoretical limits to running modern AI workloads and explains the architecture and techniques (in particular methods of parallelism) you can use.

Yes it's all TPU focussed (other than this most recent part) but a lot of what it discusses are generally principles you can apply elsewhere (or easy enough to see how you could generalise them).

tucnak|6 months ago

This post is a great illustration why TPU's lend more nicely towards homogenous computing: yes, there's systolic array limitations (not good for sparsity) but all things considering, bandwidth doesn't change as your cluster ever so larger grows. It's a shame Google is not interested in selling this hardware: if they were available, it would open the door to compute-in-network capabilities far beyond what's currently available; by combining non-homogenous topologies involving various FPGA solutions, i.e. with Alveo V80 exposing 4x800G NIC's.

Also: it's a shame Google doesn't talk about how they use TPU's outside of LLM.

namibj|6 months ago

Do TPUs allow having a variable array dimension at somewhat inner nesting level of the loop structure yet? Like, where you load expensive (bandwidth-heavy) data in from HBM, process a variable-length array with this, then stow away/accumulate into a fixed-size vector?

Last I looked they would require the host to synthesize a suitable instruction stream for this on-the-fly with no existing tooling to do so efficiently.

An example where this would be relevant would be LLM inference prefill stage with (activated) MoA expert count on the order of — to a small integer smaller than — the prompt length, where you'd want to only load needed experts and only load each one at most once per layer.

pbrumm|6 months ago

If you have optimized your math heavy code and it is already in a typed language and you need it to be faster, then you think about the GPU options

In my experience you can roughly get 8x speed improvement.

Turning a 4 second web response into half a second can be game changing. But it is a lot easier to use a web socket and put a spinner or cache result in background.

Running a GPU in the cloud is expensive

aktuel|6 months ago

What is the "Use completions" toggle supposed to do? If I enable it I just get empty responses.

physicsguy|6 months ago

It’s interesting that nvshmem has taken off in ML because the MPI equivalents were never that satisfactory in the simulation world.

Mind you, I did all long range force stuff which is difficult to work with over multiple nodes at the best of times.

ngcc_hk|6 months ago

This is part 12 … the title seems to hint on how do one think about Gpu today … eg why llm comes about. Instead it is about cf with tpu? And then I note the part 12 … not sure what one should expect to jump in the middle of a whole series and what … well may stop and move on.

aanet|6 months ago

Fantastic resource! Thanks for posting it here.

radarsat1|6 months ago

Why haven't Nvidia developed a TPU yet?

dist-epoch|6 months ago

This article suggests they sort of did: 90% of the flops is in matrix multiplication units.

They leave some performance on the table, but they gain flexible compilers.

Philpax|6 months ago

They don't need to. Their hardware and programming model are already dominant, and TPUs are harder to program for.

HarHarVeryFunny|6 months ago

Meaning what? Something less flexible? Less CUDA cores and more Tensor Cores?

The majority of NVidia's profits (almost 90%) do come from data center, most of which is going to be neural net acceleration, and I'd have to assume that they have optimized their data center products to maximize performance for typical customer workloads.

I'm sure that Microsoft would provide feedback to Nvidia if they felt changes were needed to better compete with Google in the cloud compute market.

porridgeraisin|6 months ago

A short addition that pre-volta nvidia GPUs were SIMD like TPUs are, and not SIMT which post-volta nvidia GPUs are.

camel-cdr|6 months ago

SIMT is just a programming model for SIMD.

Modern GPUs still are just SIMD with good predication support at ISA level.

akshaydatazip|6 months ago

Thanks for the really thorough research on that . Right what I wanted for my morning coffee

boxerab|6 months ago

"How to Think About NVIDIA GPUS" is a better title

unknown|6 months ago

[deleted]

tomhow|6 months ago

Discussion of original series:

How to scale your model: A systems view of LLMs on TPUs - https://news.ycombinator.com/item?id=42936910 - Feb 2025 (30 comments)

radarsat1|6 months ago

A comment from there:

> There are plans to release a PDF version; need to fix some formatting issues + convert the animated diagrams into static images.

I don't see anything on the page about it, has there been an update on this? I'd love to put this on my e-reader.

varelse|6 months ago

[deleted]

evrennetwork|6 months ago

[deleted]

business_liveit|6 months ago

so, Why didn't Nvidia developed a TPU yet?

cwmoore|6 months ago

Probably proprietary. Go GOOG. I like how bad your comment is.

122 comments