top | item 41777095

(no title)

_hl_ | 1 year ago

Some of the "prior art" here is ladder networks and to some handwavy extent residual nets, both of which can be interpreted as training the model on reducing the error to its previous predictions as opposed to predicting the final result directly. I think some intuition for why it works has to do with changing the gradient descent landscape to be a bit friendlier towards learning in small baby steps, as you are now explicitly designing the network around the idea that it will start off making lots of errors in its predictions and then get better over time.

discuss

No comments yet.