top | item 25488912

(no title)

pretty_dumm_guy | 5 years ago

Hi Professor,

Good day,

I was wondering whether it is be possible for you to provide an overview of different methods that you think might have a better shot at replacing backpropagation algorithm?

discuss

order

duvenaud|5 years ago

Sure. First of all, I want to say that backprop, by which I mean reverse-mode differentiation for computing gradients, combined with gradient descent for updating parameters, is pretty great. In a sense it's the last thing we should be trying to replace, since pretty much the whole deep learning revolution was about replacing hand-designed functions with ones that can be optimized in this way.

Reverse-mode differentiation has about the same time cost as whatever function you're optimizing, no matter how many parameters you need gradients for. This which is about as good as one could hope for, and is what lets it scale to billions of parameters.

The main downside of reverse-mode differentiation (and one of the reasons it's biologically implausible) is that it requires storing all the intermediate numbers that were computed when evaluating the function on the forward pass. So its memory cost grows with the complexity of the function being optimized.

So the main practical problem with reverse-mode differentiation + gradient descent is the memory requirement, and much of the research presented in the workshop is about ways to get around this. A few of the major approaches are:

1) Only storing a subset of the forward activations, to get noisier gradients at less memory cost. This is what the "Randomized Automatic Differentiation" paper does. You can also save memory and get exact gradients if you re-construct the activations as you need them (called checkpointing), but this is slower.

2) Only training one layer at a time. This is what the "Layer-wise Learning" papers are doing. I suppose you could also say that this is what the "feedback alignment" papers are doing.

3) If the function being optimized is a fixed-point computation (such as an optimization), you can compute its gradient without needing to store any activations by using the implicit function theorem. This is what my talk was about.

4) Some other forms of sensitivity analysis (not exactly the same as computing gradients) can be done by just letting a dynamical system run for a little while. Barak Pearlmutter has some work on how he thinks this is what happens in slow-wave sleep to make our brains less prone to seizures when we're awake.

I'm missing a lot of relevant work, and again I don't even know all the work that was presented at this one workshop. But I hope this helps.

bmc7505|5 years ago

> Barak Pearlmutter has some work on how he thinks this is what happens in slow-wave sleep to make our brains less prone to seizures when we're awake.

Interesting! I am more familiar with Pearlmutter's work on automatic differentiation, but was was unaware of this work with Houghton.

A new hypothesis for sleep: tuning for criticality: https://zero.sci-hub.se/2153/6c1cfbc1b78d23ef2e1cb7102dd8339...

There is also a related paper on wake-sleep learning from UofT, of which I am sure you are aware:

The wake-sleep algorithm for unsupervised neural networks: https://www.cs.toronto.edu/~hinton/absps/ws.pdf

Are you aware of any recent work investigating the role of sleep in biological and statistical learning?

pretty_dumm_guy|5 years ago

Thank you for your answer. It appears to me that we are trying to achieve an algorithm that has better time complexity than the one that we have right now(reverse mode differentiation with gradient descent).

Is it possible to combine these methods in a straight forward manner with methods that try to reduce the space complexity? For example, Lottery ticket hypothesis(https://arxiv.org/abs/1803.03635) seems to reduce spacial complexity(Please do correct me if I am wrong).

Also, based on my rather poor and limited knowledge, it appears to me that set of proposed methods that reduced space complexity and set of proposed methods that reduce time complexity are disjoint. Is that the case ?