top | item 31682887

Techniques for Training Large Neural Networks

133 points| todsacerdoti | 3 years ago |openai.com | reply

23 comments

order
[+] liuliu|3 years ago|reply
1. I missed this paper https://www.microsoft.com/en-us/research/uploads/prod/2018/0... when doing my own feature map compression research: https://liuliu.me/eyes/reduce-another-70-memory-usage-for-de.... Thanks for point it out!

2. Otherwise, most of these are obvious optimizations. One widely popular optimization (that I consider non-obvious) is ZeRO-Offload, in particularly the gradient sharding scheme (although once learned, the sharding itself is pretty straightforward, just a bit chatty). One thing I think undervalued these years though is Alex's "One weird trick": https://arxiv.org/abs/1404.5997. This scheme is much convoluted but very effective when training large MLP. It is not popular probably because a). the implementation is non-obvious; b). large MLP fall out of fashion quickly, and the computation shape for transformer looks probably very different from the MLP its originally trying to solve (with 4096 activations per layer).

[+] MathYouF|3 years ago|reply
This is really intersting.

> In particular, model parallelism is efficient when the amount of computation per neuron activity is high (because the neuron activity is the unit being communicated), while data parallelism is efficient when the amount of computation per weight is high (because the weight is the unit being communicated).

In this case, for fully connected layers, the amount of computation per neuron is high because it is (fully) connected to and from every other neuron in the pervious and following layer, and therefore we want more GPU's to work on those in parallel, so it isn't a time bottleneck?

What does it mean then for to have a high "computation per weight".

>• Convolutional layers cumulatively contain about 90-95% of the computation, about 5% of the parameters, and have large representations. • Fully-connected layers contain about 5-10% of the computation, about 95% of the parameters, and have small representations.

Wouldn't fc's require more computation since each parameter has more incoming and outgoing values?

I don't doubt this is all accurate and I'm just missing some obvious intuition, but for the sake of learning feel free to explain it if you wish!

[+] rllin|3 years ago|reply
"One weird trick" is still pretty much the first way to go for most recommendation systems that are large embeddings focused. see torchrec, nvidia's hugectr
[+] Der_Einzige|3 years ago|reply
What if you abandoned gradients/back-prop entirely? They knew that for reinforcement learning this was good clear back in 2017!

https://openai.com/blog/evolution-strategies/

Put engineering effort into creating fast, GPU powered (tensorflow/pytorch) algorithms for neuroevolution of the weights of a neural network. I'm still not convinced that properly leveraged gradient free/Feed-forward only optimizers have been tried yet by researchers - namely because they never actually wrote fast gradient free optimizers!

[+] hungrigekatze|3 years ago|reply
I was curious as to what the 'community AI' research org's stances on distributed training of deep neural nets were so some weeks ago I stumbled upon Eleuther AI's FAQ page which was talking about how it was not a task that they were looking at due to various technological challenges:

Source: https://www.eleuther.ai/faq/

What about volunteer-driven distributed computing, like BOINC, Folding@Home, or hivemind? -Backpropagation is dense and sensitive to precision, therefore requiring high-bandwidth communication. Consumer-grade internet connections are wholly insufficient. -Mixture-of-experts-based models tend to significantly underperform monolithic (regular) models for the same number of parameters. -Having enough contributors to outweigh the high overhead is infeasible. -Verifiability and resistance to outside attack are not currently possible without significant additional overhead. In short, doing volunteer-driven distributed compute well for this use case is an unsolved problem.

---

Am really excited to see inroads being made in this field of active research and hope that all AI orgs - OpenAI, Eleuther, etc. - can take part in this in domain of much-needed (IMO) research.

[+] superb-owl|3 years ago|reply
I'm really excited to see folks starting to talk about parallelizing machine learning. The conversation has been dominated by GPU-friendly techniques - a classic example of "everything looks like a nail when you have a hammer".

I hope we start seeing more massively parallel training strategies (most likely with GPUs under the hood still)

[+] kristjansson|3 years ago|reply
> starting

This is a strange comment to see. The techniques in TFA have been at the heart of a the last few years of progress in large-scale language models, image generation, etc?

[+] rg111|3 years ago|reply
Dude, parallelization is exactly how non-trivial neural networks are trained since the beginning.

Parallel processing as in SIMD architecture is what makes Deep Learning possible now.

Parallel computing as in stacking graphics card is how literally everything is done everywhere- from big tech companies to not-so-well-funded data science departments in unknown unis.

You are so so off.

[+] ShamelessC|3 years ago|reply
Strange comment. May need to edit something? I can't parse it - parallel (e.g. CUDA) techniques have been popular since AlexNet.