top | item 33120140

(no title)

omegalulw | 3 years ago

This is not that relevant for ML. Each gradient pass will re-compute your cost function and the gradients so errors are not likely to accumulate. The main thing is to not make errors big enough that you end up in a completely different part of the parameter space derailing progress which is what the above commenter points out.

discuss

thesz|3 years ago

It is extremely relevant for ML.

I am familiarizing myself with recurrent neural networks and getting them trained online is a pain - I get NaNs all the time except for very small learning rates that actually prevent my networks to learn anything.

The deeper network is, the more pronounced accumulation of errors in online training is. Add 20-30 fully connected (not highway or residual) layers before softmax and you'll see wonders there, you won't be able to have anything stable.

emn13|3 years ago

This isn't true in general. Very specific ML algorithms that were likely developed with years of blood and sweat and tears may have this kind of resiliency, but I've been in the the numerical weeds enough here that I wouldn't bet on even that without a real expert weighing in on it - and I wonder what the tradeoff is if it's true there. It's very easy to have numerical stability issues absolutely crater ML results; been there, done that.

I have some ~15 year old experience with the math behind some of this, but actually none with day-to-day deep learning applications using any of the now-conventional algorithms, so my perspective here is perhaps not that of the most pragmatic user. The status quo may have improved, at least de facto.

kelseyfrog|3 years ago

I'm not really sure there is evidence for that. In fact, depending on your interpretation of why posits[1] work, we may even have empirical evidence that the opposite is true.

1. https://spectrum.ieee.org/floating-point-numbers-posits-proc...