(no title)
ml_hardware | 4 years ago
1) NVIDIA will likely release a variant of H100 with 2x memory, so we may not even have to wait a generation. They did this for V100-16GB/32GB and A100-40GB/80GB.
2) In a generation or two, the SOTA model architecture will change, so it will be hard to predict the memory reqs... even today, for a fixed train+inference budget, it is much better to train Mixture-Of-Experts (MoE) models, and even NVIDIA advertises MoE models on their H100 page.
MoEs are more efficient in compute, but occupy a lot more memory at runtime. To run an MoE with GPT3-like quality, you probably need to occupy a full 8xH100 box, or even several boxes. So your min-inference-hardware has gone up, but your efficiency will be much better (much higher queries/sec than GPT3 on the same system).
So it's complicated!
TOMDM|4 years ago
I really do wonder how much more you could squeeze out of a full pod of gen2-H100's, obviously the model size would be ludicrous, but how far are we into the realm of dimishing returns.
Your point about MoE architectures certainly sounds like the more _useful_ deployment, but the research seems to be pushing towards ludicrously large models.
You seem to know a fair amount about the field, is there anything you'd suggest if I wanted to read more into the subject?
ml_hardware|4 years ago
A pod of gen2-H100s might have 256 GPUs with 40 TB of total memory, and could easily run a 10T param model. So I think we are far from diminishing returns on the hardware side :) The model quality also continues to get better at scale.
Re. reading material, I would take a look at DeepSpeed’s blog posts (not affiliated btw). That team is super super good at hardware+software optimization for ML. See their post on MoE models here: https://www.microsoft.com/en-us/research/blog/deepspeed-adva...
algo_trader|4 years ago
Are these techniques for specific architectures or can they be made generic ?
ml_hardware|4 years ago
See the paper here, Figure A28: https://kstatic.googleusercontent.com/files/b068c6c0e64d6f93...
But if your downstream task is simple, like sequence classification, then it may be possible to compress the model without losing much quality.
algo_trader|4 years ago
https://www.tensorflow.org/model_optimization/guide/pruning
https://www.tensorflow.org/model_optimization/guide/pruning/...