Huggingface has been working on implementing this into their library, and it has some pretty amazing effects on the size of models you can train on a simple Colab.
Question for someone knowledgable about this: if I have a model which is large -- but small enough that I can fit a single training example on GPU -- does this approach offer speedups compared to simple gradient accumulation? Or is this only useful for models which are so large that the model parameters themselves are overwhelming GPU memory?
GPT-NeoX is an example project that is using deepspeed and Zero-3 offloading. The wider project intend to train a GPT-3 sized model and release it freely to the world.
> ZeRO removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency
See also zeroth order backpropagation which allows 300X faster training while not reducing throughput that much
https://arxiv.org/abs/2011.08895
How much zero-3 affect accuracy?
Alternatively, one could get rid of the memory used by optimizers entirely by switching to vanilla SGD.
I haven’t tried this on transformers and maybe that’s what breaks down here but in “classic” supervised settings I’ve found SGD with schedule tuning just as fast as Adam.
You ideally need ~500GB of text, or so. EleutherAI's The Pile was designed to be just big enough to fit a 1t GPT efficiently, and you can get the various scaling curves out of the OA-related scaling papers. (You want the amount of data that fits into a single epoch, because if you reuse data, you get less bang for the FLOPs buck, and FLOPS constraints are right now much more binding than data or model size.)
FL33TW00D|5 years ago
https://huggingface.co/blog/zero-deepspeed-fairscale
stephenroller|5 years ago
diptanu|5 years ago
ansk|5 years ago
joshlk|5 years ago
https://github.com/EleutherAI/gpt-neox
ma2rten|5 years ago
https://github.com/EleutherAI/gpt-neox/issues/171
dataangel|5 years ago
jonbaer|5 years ago
jiofih|5 years ago
> ZeRO removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency
bevenky|5 years ago
https://github.com/pytorch/pytorch/pull/46750
minimaxir|5 years ago
alphagrep12345|5 years ago
The_rationalist|5 years ago
See also https://github.com/microsoft/fastformers
vladf|5 years ago
I haven’t tried this on transformers and maybe that’s what breaks down here but in “classic” supervised settings I’ve found SGD with schedule tuning just as fast as Adam.
gwern|5 years ago
andrewprock|5 years ago
gwern|5 years ago
singhrac|5 years ago
bionhoward|5 years ago
mchusma|5 years ago