top | item 30769655

(no title)

Couple points:

1) NVIDIA will likely release a variant of H100 with 2x memory, so we may not even have to wait a generation. They did this for V100-16GB/32GB and A100-40GB/80GB.

2) In a generation or two, the SOTA model architecture will change, so it will be hard to predict the memory reqs... even today, for a fixed train+inference budget, it is much better to train Mixture-Of-Experts (MoE) models, and even NVIDIA advertises MoE models on their H100 page.

MoEs are more efficient in compute, but occupy a lot more memory at runtime. To run an MoE with GPT3-like quality, you probably need to occupy a full 8xH100 box, or even several boxes. So your min-inference-hardware has gone up, but your efficiency will be much better (much higher queries/sec than GPT3 on the same system).

So it's complicated!

discuss

TOMDM|4 years ago

Oh I totally expect the size of models to grow along with whatever hardware can provide.

I really do wonder how much more you could squeeze out of a full pod of gen2-H100's, obviously the model size would be ludicrous, but how far are we into the realm of dimishing returns.

Your point about MoE architectures certainly sounds like the more _useful_ deployment, but the research seems to be pushing towards ludicrously large models.

You seem to know a fair amount about the field, is there anything you'd suggest if I wanted to read more into the subject?

ml_hardware|4 years ago

I agree! The models will definitely keep getting bigger, and MoEs are a part of that trend, sorry if that wasn’t clear.

A pod of gen2-H100s might have 256 GPUs with 40 TB of total memory, and could easily run a 10T param model. So I think we are far from diminishing returns on the hardware side :) The model quality also continues to get better at scale.

Re. reading material, I would take a look at DeepSpeed’s blog posts (not affiliated btw). That team is super super good at hardware+software optimization for ML. See their post on MoE models here: https://www.microsoft.com/en-us/research/blog/deepspeed-adva...

algo_trader|4 years ago

Is it difficult/desirable to squeeze/compress an open-sourced 200B parameter model to fit into 40GB?

Are these techniques for specific architectures or can they be made generic ?

ml_hardware|4 years ago

I think it depends what downstream task you're trying to do... DeepMind tried distilling big language models into smaller ones (think 7B -> 1B) but it didn't work too well... it definitely lost a lot of quality (for general language modeling) relative to the original model.

See the paper here, Figure A28: https://kstatic.googleusercontent.com/files/b068c6c0e64d6f93...

But if your downstream task is simple, like sequence classification, then it may be possible to compress the model without losing much quality.

algo_trader|4 years ago

Ah, found some stuff already

https://www.tensorflow.org/model_optimization/guide/pruning

https://www.tensorflow.org/model_optimization/guide/pruning/...