(no title)
t-vi | 4 months ago
For example, Alex Graves's (great! with attention) 2013 paper "Sequence Generation with Recurrent Neural Networks" has this line:
One difficulty when training LSTM with the full gradient is that the derivatives sometimes become excessively large, leading to numerical problems. To prevent this, all the experiments in this paper clipped the derivative of the loss with respect to the network inputs to the LSTM layers (before the sigmoid and tanh functions are applied) to lie within a predefined range.
with this footnote:
In fact this technique was used in all my previous papers on LSTM, and in my publicly available LSTM code, but I forgot to mention it anywhere—mea culpa.
That said, backpropagation seems important enough to me that I once did a specialized videocourse just about PyTorch (1.x) autograd.
HarHarVeryFunny|4 months ago
Perhaps, but maybe because there was more experimentation with different neural net architectures and nodes/layers back then?
Nowadays the training problems are better understood, clipping is supported by the frameworks, and it's easy to find training examples online with clipping enabled.
The problem itself didn't actually go away. ReLU (or GELU) is still the default activation for most networks, and training an LLM is apparently something of a black art. Hugging Face just released their "Smol Training Playbook: a distillation of hard earned knowledge to share exactly what it takes to train SOTA LLMs", so evidentially even in 2025 training isn't exactly a turn-key affair.
joe_the_user|4 months ago