top | item 26447018

Zero-3 Offload: Scale DL models to trillion parameters without code changes

97 points| ghosthamlet | 5 years ago |deepspeed.ai

48 comments

order

ansk|5 years ago

Question for someone knowledgable about this: if I have a model which is large -- but small enough that I can fit a single training example on GPU -- does this approach offer speedups compared to simple gradient accumulation? Or is this only useful for models which are so large that the model parameters themselves are overwhelming GPU memory?

dataangel|5 years ago

ELI5? All this techno babble just sounds like "it's faster because we optimized it". What are the nontrivial, new fundamental tricks?

jiofih|5 years ago

Third paragraph or so in the overview:

> ZeRO removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency

vladf|5 years ago

Alternatively, one could get rid of the memory used by optimizers entirely by switching to vanilla SGD.

I haven’t tried this on transformers and maybe that’s what breaks down here but in “classic” supervised settings I’ve found SGD with schedule tuning just as fast as Adam.

gwern|5 years ago

SGD doesn't work on large Transformers, no. You need something like AdamW.

andrewprock|5 years ago

How much data do you need to mitigate the risk of over fitting a trillion parameter model?

gwern|5 years ago

You ideally need ~500GB of text, or so. EleutherAI's The Pile was designed to be just big enough to fit a 1t GPT efficiently, and you can get the various scaling curves out of the OA-related scaling papers. (You want the amount of data that fits into a single epoch, because if you reuse data, you get less bang for the FLOPs buck, and FLOPS constraints are right now much more binding than data or model size.)

singhrac|5 years ago

For those searching, DeepSpeed is implemented as a set of C++/CUDA extensions on top of PyTorch (compiled using their JIT).

bionhoward|5 years ago

please hook this up to Jax!

mchusma|5 years ago

This is super impressive. I could not figure out for a while who exactly was running this project, but it looks like its Microsoft. Great work!