top | item 35800772

(no title)

Very insightful!! A 175B parameter model with 2 bytes per weight, and say 2 bytes per gradient (not sure if single precision gradients makes sense?) comes in at 700GB, which is beyond a single 8x80GB beefy machine!! I recall reading with tech such as RDMA, you can communicate really fast between machines .. I assume if you add a switch in there, you are toast (from a latency perspective). Perhaps using 2 such beefy machines in a pair would do the trick .. after all .. model weights aren't the only thing that needs to be on the GPU.

I saw a reference that said GPT-3, with 96 decoder layers, was trained on a 400 GPU cluster, so that seems like the ballpark for a 175B parameter model. That's 50 of the hypothetical machines we talked about (well .. really 100 for GPT-3 since back in those days, max was 40 or 48 GB per GPU).

I also wonder why NVIDIA (or Cerebras) isn't beefing up GPU memory. If someone sold a 1TB GPU, they could charge a 100grand easy. As I understood it, NVIDIA's GPU memory is just HBM-6 .. so they'd make a profit?

discuss

spi|2 years ago

Looking here: https://huggingface.co/docs/transformers/perf_train_gpu_one#... It looks like the most standard optimizer (AdamW) uses a whopping 18 bits per parameter during training. Using bf16 should reduce that somehow, but it wasn't really considered in that section, I'm not sure if that part of the guide is a bit outdated (before A10 / A100 this wasn't an option) or if it still has some instability issues ("normal" float16 can't be used for training because multiplying gradients through the hundreds of layers you'd get 0 or infinity values that would kill your learning). You can switch to different optimizers (Adafactor) and modify a few other things, but that typically comes at the cost of either lower accuracy or slower training, or both.

For multiple GPUs there are quite a few ways to improve memory footprint and speed: https://huggingface.co/docs/transformers/perf_train_gpu_many Although I'm not sure if the implementations in HuggingFace are really on par with the SOTA methods (they shouldn't be far away in any case). I guess they should be at least on par, if not better, with whatever OpenAI used for GPT-3 back then, things evolving so quickly in this realm...

On the last point, I can only assume there are some hard thresholds which are difficult to overcome in order to add more memory, otherwise they would. Just an 80GB memory GPU was something unthinkable a dozen years ago, before the deep learning explosion around 2GB was the norm. A couple of years ago, when 16GB or 32GB was the best you'd get from Nvidia, AMD did come out with consumer grade GPUs having significant larger memory (maybe 48GB back then? I can't remember), which could have stirred the market a bit I guess, but it didn't pick up for deep learning (I suspect mostly due to a lack of the equivalent to cudnn / cuda, that makes it possible to "easily" build deep learning frameworks on top of the GPUs).

My take on this is, if there's a competitor who fights hard to regain market share, and bets big on offering more memory, and still the best it comes up with is just a couple of times more than what the others have, it must be not as easy as "let's stick another bank of memory here and sell it", or they would have...?

startupsfail|2 years ago

GPU memory is also useful to load large detailed scenes for rendering (.usd). It is a bit surprising that 80GB is the limit. It was obvious for years that GPU compute is ahead of GPU memory size by 10x-100x. And loading larger models and scenes into memory was always a struggle. This must be a hardware or yields issue.