top | item 40337936

GPUs Go Brrr

1104 points| nmstoker | 1 year ago |hazyresearch.stanford.edu | reply

263 comments

order
[+] Animats|1 year ago|reply
"And we ask: if your matrix multiply is smaller than 16x16, are you sure what you’re doing is AI?

From a philosophical point of view, we think a frame shift is in order. A “register” certainly shouldn’t be a 32-bit word like on the CPUs of old. And a 1024-bit wide vector register, as CUDA uses, is certainly a step in the right direction. But to us a “register” is a 16x16 tile of data. We think AI wants this."

The hardware needs of AI are starting to focus. GPUs, after all, were designed for an entirely different job. They're used for AI because they have good matrix multiply hardware. "AI GPUs" get to leave out some of the stuff in a real GPU (does an H100 even have texture fill units?). Then there's a trend towards much shorter numbers. 16 bit floating point? 8 bit? 2 bit? 1 bit? That will settle out at some point. This paper indicates that hardware that likes 16x16 tiles makes a lot of sense. It's certainly possible to build such hardware. Someone reading this is probably writing it in VHDL right now, or will be soon.

Then we'll see somewhat simpler, less general, and cheaper devices that do "AI" operations with as little excess hardware baggage as possible. Nice.

[+] bcatanzaro|1 year ago|reply
GPUs have evolved to be AI machines with as little baggage as possible. People have been arguing GPUs were old technology and therefore unsuited for AI since at least 2014 (when Nervana was founded), but what they perhaps didn’t expect is that the GPU would evolve so quickly to be an AI machine.
[+] dvt|1 year ago|reply
> Then we'll see somewhat simpler, less general, and cheaper devices that do "AI" operations with as little excess hardware baggage as possible. Nice.

Apple has already been doing this for a few years now. The NPU is totally different from the GPU or CPU on the die itself[1]. Nvidia is likely working on this as well, but I think a device that's a gaming/entertainment/crypto/AI bundle (i.e. sticking with the video card) is probably a better business move.

[1] https://github.com/hollance/neural-engine/blob/master/docs/a...

[+] choppaface|1 year ago|reply
“NVidia’s LIES..

On kernels such as flash attention, TMA and the L2 cache are both fast enough so as to hide these problems reasonably well. But to make the full use of the hardware, memory request must be coalesced and bank conflicts avoided ”

The depth of the competition is also starting to become apparent. There’s no way the documentation error was totally an accident. Diagrams are the easiest to steal / copy and there must have been some utility for nvidia to have left this in place. Remember when Naveen Rao’s Nervana was writing NVidia Maxwell drivers that out-performed NVidia’s own? Not every documentation mishap in a high-growth product is a competition counter-measure, but given that the researchers spent so long reverse-engineering wgmma and given the China-US political situation of the H100 in particular, it seems NVidia is up to its old tricks to protect its moat.

So don’t over-study the H100 peculiarities, as “what hardware does AI want?” really encompasses the commercial situation as well.

[+] jiveturkey|1 year ago|reply
hasn't google been building such devices for a decade now?
[+] mvkel|1 year ago|reply
Would you say this is ultimately "ASICs for AI"?
[+] WanderPanda|1 year ago|reply
Wait but nvidia tensor-cores are exactly the hardware that likes 16x16 tiles, no? I thought that was the whole point? The hardware is already here and I'm sceptical if there is another order of magnitude in performance to be gained from even more specialized designs.
[+] muyuu|1 year ago|reply
it's going to be awkward in consumer hardware either way

if you segregate AI units from the GPU, the thing is both AI and GPUs will continue to need massive amounts of matrix multiplication and as little memory latency as possible

the move to have more of it wrapped in the GPU makes sense but at least in the short and medium term, most devices won't be able to justify the gargantuan silicon wafer space/die growth that this would entail - also currently Nvidia's tech is ahead and they don't make state of the art x86 or ARM CPUs

for the time being I think the current paradigm makes the most sense, with small compute devices making inroads in the consumer markets as non-generalist computers - note that more AI-oriented pseudo-GPUs already exist and are successful since the earlier Nvidia Tesla lineup and then the so-called "Nvidia Data Center GPUs"

[+] UncleOxidant|1 year ago|reply
> Then there's a trend towards much shorter numbers. 16 bit floating point? 8 bit? 2 bit? 1 bit?

There was that recent paper titled "The Era of 1-bit LLMs" [0] which was actually suggeting a 1.58 bit LLM (2 bits in practice).

> Someone reading this is probably writing it in VHDL right now, or will be soon.

Yeah, I think I'm in the "will be soon" camp - FPGA board has been ordered. Especially with the 2-bit data types outlined in that paper [0] and more details in [1]. There's really a need for custom hardware to do that 2-bit math efficiently. Customizing one of the simpler open source RISC-V integer implementations seems like something to try here adding in the tiled matrix registers and custom instructions for dealing with them (with the 2 bit data types).

[0] https://arxiv.org/abs/2402.17764 [1] https://github.com/microsoft/unilm/blob/master/bitnet/The-Er...

[+] renonce|1 year ago|reply
> NVIDIA’s lies. This is an extraordinarily misleading representation of the actual 128b swizzled wgmma layout. This diagram cost us three weeks of life that we will not get back, hence the public shaming.

Wondering if anyone would be surprised that a huge amount of progress in AI is on the engineering side (optimizing matmuls), and that a huge portion of the engineering is about reverse engineering NVIDIA chips

[+] DeathArrow|1 year ago|reply
Architecture doesn't make a difference. Big enough models trained with big enough data tend to give the same results regardless of architecture. So yes, most advances in AI are mostly due to the fact we can now multiply matrices very fast.
[+] panki27|1 year ago|reply
Warp scheduler, 4 quadrants, tensor memory accelerator, unswizzled wgmma layouts...

The line between GPU lingo and Star Trek technobabble fades away further and further.

[+] araes|1 year ago|reply
There was some awareness reading the article, yet "we're warping through the quadrant in our tensor accelerator" is pretty Trek.

Have had that thought occasionally with some of the other articles. What it must read like to somebody who gets a ref link for an article over here. Wandered into some Trek nerd convention discussing warp cores.

[+] Agentlien|1 year ago|reply
Your comment prompted me to take a step back and look at these terms with new eyes. That made me smile, because you're so right.
[+] winternewt|1 year ago|reply
I believe that reducing the power consumption and increasing the speed of AI inference will be best served by switching to analog, approximate circuits. We don't need perfect floating-point multiplication and addition, we just need something that takes an two input voltages and produces an output voltage that is close enough to what multiplying the input voltages would yield.
[+] perfmode|1 year ago|reply
This article rekindles the joy I experienced during CS 149 Parallel Programming.
[+] Aaryan44|1 year ago|reply
Kayvon and Kunle are amazing - I took CS149 Parallel Programming two quarters ago and loved it :)
[+] figbert|1 year ago|reply
Appreciate the recommendation, will check out the course!
[+] latchkey|1 year ago|reply
Really impressed by the writing style of this post and very much looking forward to this on AMD MI300x. Let me know if you want some time on mine.
[+] diginova|1 year ago|reply
What should I do if I want to understand such articles in complete? where to start on the roadmap?
[+] kolinko|1 year ago|reply
This is a good course on gpu programming. Around 4.0 lesson you’ll get the required basics: https://youtube.com/playlist?list=PLzn6LN6WhlN06hIOA_ge6Srgd...

Also, write your own cuda kernel to do vector-matrix multiplication (if you use pycuda, you can focus on the kernel, and write everything else with python). Just tell chatgpt that you want to write your own implementation that multiplies a 4000-element vector by 4000x12000 matrix, and to guide you through the whole process.

For renting gpus, runpods is great - right now they have everything from lower tier gpus to h100s. You can start with a lesser gpu at the beginning.

[+] joaquincabezas|1 year ago|reply
wow their graphs at the GitHub README (https://github.com/HazyResearch/ThunderKittens/blob/main/att...) make me extremely dizzy. Are these wavy bars even legal? :P
[+] bogtog|1 year ago|reply
I second this. It's like they're trying to incorporate some optical illusion. I'd even prefer just seeing numbers without any bars
[+] lucidrains|1 year ago|reply
would be interested to see thunderkittens (great name!) tackle the flash attention backwards pass, which is an order of magnitude harder than the forward
[+] imiric|1 year ago|reply
Hasn't this research been done by teams building NPUs today? E.g. chips built by Groq use an architecture built specifically for AI, which is why they're able to deliver the performance they do. On the consumer side, Apple silicon is also quite capable.

I'm not in this field at all, but it seems to me that using general purpose processors that communicate over (relatively) slow lanes can only get us so far. Rethinking the design at the hardware level, and eventually bringing the price down for the consumer market seems like a better long-term strategy.

[+] phinnaeus|1 year ago|reply
FYI the caption of the "spirit animals" image says "canadian goose" instead of "Canada Goose".
[+] adzm|1 year ago|reply
Likely a regional thing; they are consistently called Canadian Geese where I grew up and where I currently live.
[+] silisili|1 year ago|reply
I've only heard people in my entire lifetime call them Canadian Geese.

The only time I've ever even seen or heard of Canada Goose/Geese are people on the internet telling others they are wrong.

I think it's time to just accept it as correct.

[+] xarope|1 year ago|reply
I am missing the reference to the canadian goose and the retriever puppy as spirit animals. Is that to say the H100 is an ornery thing, but the RTX4090 is friendly?
[+] downrightmike|1 year ago|reply
Don't worry, the Geese are en route to location, resolution incoming. Stand by.
[+] bombcar|1 year ago|reply
It’s a Canada Goose from Canada. A Canadian Canada Goose, or Canadian Goose.
[+] wglb|1 year ago|reply
An error too often made.
[+] fastball|1 year ago|reply
Canadian goose seems better in [current year], to avoid confusion with the clothing brand.
[+] adrian_b|1 year ago|reply
I consider bad the habit of English to use nouns also as adjectives, because it causes many ambiguities, some of which can be very annoying, even if they are a rich source of jokes and word plays.

In most languages the use of a noun as an adjective is marked, by a particle or by an affix or at least by a different stress pattern (like moving the stress to the last syllable), which removes the ambiguities.

So for most non-native speakers "Canadian goose" makes much more sense than "Canada goose" (which may feel like "Canada and a goose" or "a goose that is also Canada" and not like "a goose from Canada").

[+] weinzierl|1 year ago|reply
"For this post, we’re going to focus on the NVIDIA H100 [... because] we think the trends it implies are going to continue in future generations, and probably from other manufacturers, too."

Is it though? Wouldn't we expect to see more advanced packaging technology eventually?

If that happens the increased memory bandwidth could be an enabler for a unified memory architecture like in the Nvidia Jetson line. In turn that would make a lot of what the article says make GPU go Brr today moot.

[+] chefandy|1 year ago|reply
One of my biggest struggles in doing AI stuff on consumer hardware is heat. I noticed zero discussion of this so I assume it's an implementation detail on small systems that doesn't really factor into more robust setups. Is that the really case, or is this just diving into the comp sci layer of hardware utilization and ignoring things like heat because it's not salient to this subtopic?
[+] nostrebored|1 year ago|reply
It factors into robust setups but is part and parcel of doing any HPC where you're pushing through a ton of TFLOPS. It's a problem that is assumed to have been solved when you're doing this kind of work.
[+] hi-v-rocknroll|1 year ago|reply
NVIDIAs stock will plummet in 3-4 years after Microsoft and Meta stop spending tens of billions without having a specific use for H100's and end up with a ridiculous amount of excess capacity. Hopefully, that means some H100-based systems will end up on eBay in ~5-8 years for home lab use.
[+] _spl|1 year ago|reply
It reminds me of when I first read about superscalar CPU architecture and was amazed. GPUs are really next level.
[+] wmab|1 year ago|reply
The amount of comma splicing, (parentheses for extra points) -- and em dashes for good measure! that this post has makes it entirely unreadable.