This is not that relevant for ML. Each gradient pass will re-compute your cost function and the gradients so errors are not likely to accumulate. The main thing is to not make errors big enough that you end up in a completely different part of the parameter space derailing progress which is what the above commenter points out.
thesz|3 years ago
I am familiarizing myself with recurrent neural networks and getting them trained online is a pain - I get NaNs all the time except for very small learning rates that actually prevent my networks to learn anything.
The deeper network is, the more pronounced accumulation of errors in online training is. Add 20-30 fully connected (not highway or residual) layers before softmax and you'll see wonders there, you won't be able to have anything stable.
emn13|3 years ago
I have some ~15 year old experience with the math behind some of this, but actually none with day-to-day deep learning applications using any of the now-conventional algorithms, so my perspective here is perhaps not that of the most pragmatic user. The status quo may have improved, at least de facto.
kelseyfrog|3 years ago
1. https://spectrum.ieee.org/floating-point-numbers-posits-proc...