top | item 21442003

(no title)

scarecrow112 | 6 years ago

The post lists Auto Differentiation as one of the techniques that can overcome non-differentiable loss functions. Can someone explain how this is even possible? After all automatic differentiation[1] is a way to compute gradients/derivatives in a way which could become costly if otherwise done(symbolic differentiation[2]). The function or the operations defined in the function(in case of source-to-source differentiation[3]) needs to be differentiable.

[1] - https://www.youtube.com/watch?v=z8GyNneq5D4

[2] - https://stackoverflow.com/a/45134000

[3] - https://github.com/google/tangent

EDIT: 1. Added reference. 2. Formatting

discuss

order

ddragon|6 years ago

I think it's easier to visualize considering how pytorch (or tensorflow eager) does it. Pytorch supports loops, conditionals and other native python non differentiable operators, but what happens is that for each forward/backward pass pair you run the graph building function again. Only the functions that are overloaded by the library with the execution graph ("tape") creation steps affect what's in each final tape (and they are all differentiable), while the non differentiable native parts (like loops) only define the combination of the former differentiable pieces. You can even have a function that asks for user input every time in order to build the tape (a non pure function), as long as said graph is always differentiable at each run then auto-differentiation still works. Also, unlike symbolic differentiation, you can have a surrogate gradient that approximates the behavior of the function well enough when it's not differentiable (for example max pooling for convolutional network).

The original tensorflow work the same, but instead of running the graph every time, it embeds the non differentiable control mechanism in the code of the graph, which can more efficiently (without needing the host language to build a new one every time) create the correct differentiable tape for each run based on it's input. And source-to-source differentiation work exactly the same way, except instead of having to use a DSL (like the tensorflow graph API) and compile it, it simply uses the host language and compiler directly (so you don't need effectively two languages). Which is the case of Julia's Zygote and Swift for Tensorflow.

The only alternative to this piecewise differentiation that I know of would be creating a soft version of discrete operators, such as replacing step functions with sigmoids and case/switch/elsif operators as softmax selectors for example, which is not what any of those libraries do (it would not be easy to make it converge as the graph would be much more complex at each backward pass). In this case you could have one single graph that includes every branch though.

dimatura|6 years ago

In my opinion, the point about automatic differentiation isn't quite correct. Automatic differentiation will help you compute gradients for complicated functions involving multiple operations, as long as each of these operations are themselves differentiable (recursively; at some point, there's basic 'building blocks' with hard-coded gradients). But not all operations are inherently differentiable (in fact, most probably aren't). Now, there are other ideas such as neural Turing machines and differentiable programming, but these are not really what you'd refer to when talking about autodifferentiation.

AbrahamParangi|6 years ago

Non-differentiable here doesn't mean actually non-differentiable in the mathematical sense, it just means that the function does not expose a derivative that is accessible to you.

maffydub|6 years ago

I'm not sure. According to Wikipedia (https://en.wikipedia.org/wiki/Differentiable_function),

> a differentiable function of one real variable is a function whose derivative exists at each point in its domain

Most programs aren't differentiable functions because conditionals aren't in general differentiable. For example "if x > 0 { 1 } else { -1 }" doesn't have a derivative at 0 (and so, by the definition above) isn't differentiable at 0.

...or have I missed something?

jing|6 years ago

In practice the functions just need to be piecewise differentiable. The RELU is the canonical example for deep learning. At kinks a subderivative is used.

mlthoughts2018|6 years ago

It’s a little trickier than that, but you are generally right. The relaxation can involve many different things though, not just piecewise differentiability.

For example, when processing images that themselves may be intensity functions over a spatial domain, and the intensity function can have cusps, edges, occlusions, etc., then you need different weak-sense differentiability conditions, such as from Sobolev spaces, to guarantee that numerical gradient operators will succeed / converge where needed.

https://en.m.wikipedia.org/wiki/Sobolev_space

bigred100|6 years ago

This is beyond my reach but I’m giving it a stab:

Whatever your computer actually computes is more or less a bunch of pluses, minuses, multiplications, divisions. The function you want to compute may not be differentiable, but the sequence of arithmetic operations you actually compute to arrive at your output should basically be differentiable by the chain rule. Then an AD tool should let you differentiate that.

It seems likely to me that there’s nothing extraordinarily good about this technique without further discussion of the specific problem application given that the field of derivative free optimization exists and is actively researched. I don’t really know why those guys would bother if AD (not a new thing) supplanted their field.

In numerical computing in general I’ve seen quite a bit the idea that you can either differentiate the thing you’re trying to optimize, then use a numerical method to evaluate the derivative, or use a numerical method to evaluate the thing you want to optimize and then differentiate the numerical method. I’m not expert enough to provide any real analysis of when to use each.

jsinai|6 years ago

In theory if you had a source-to-source automatic differentiator (a programme which takes the source code for another programme f and outputs the source code for its derivative f'), then you could run backpropagation through that programme and compute the error at all computation steps.

This begs the question, what is the derivative of a programme? At its simplest (and I guess this is what the OP was trying to say), any computer programme can be reduced to arithmetic and also logical control flow operations, and thus the derivative of a programme involves calculating the derivatives each computation along all computation paths.

I'm no expert but this talk by Simon Peyton-Jones should make the concept much clearer:

https://2019.ecoop.org/details/ecoop-2019-papers/11/Automati...

kxyvr|6 years ago

This is a bit of an irritation of mine. To be clear, if a function is nondifferentiable, automatic differentiation will not magically fix this. Ever.

There are plenty of nondifferentiable, elementary functions. For example, `sqrt` is nondifferentiable at 0. Depending on if you want to include it as an elementary function, `abs` is another. Now, what people get away with is that really pathological functions such as the Dirichlet function aren't readily representable on a computer. Most of the time, points of discontinuity or nondifferentiability occur at a small finite number of locations. As such, we pretend like we can just ignore these points.

The unfortunate part is that these points of nondifferentiability to come up quite often because a lot of the interesting action in a function occurs near these points. Take `abs`, if we were doing something like `min abs(x)` we'd be cruising along moving toward the origin until we hit the kink, which turns out to be the interesting part of this function. At this point, most algorithms will have trouble.

Now, we could also say that, "Well, a subderivative/subgradient exists and the routine can just return that and it will be fine." That's all well and true if you're using an optimization algorithm that correctly deals with subderivatives and is notified of them. Most optimization algorithms do not handle subderivatives. The ones that do generally need to know that there are many subderivatives in a particular location and get to choose the one they want. A typical AD routine will not return this information.

If you want an example from the deep learning field, look at the old sigmoid activation functions. If we set aside the pretense that this is a simulation of the brain, sigmoids are a differentiable representation of the step (Heaviside) function. Recall how these algorithms had trouble with vanishing gradients? Well, the derivative of the step function is zero everywhere except at the origin where it doesn't exist. How are we supposed to optimize with that? We can't. That's why we used sigmoid functions, at the time.

Ok, does that mean we can't use nondifferentiable functions when optimizing? No. However, I would encourage everyone to clearly document the points of nondifferentiability because it makes debugging these issues vastly easier. Outside of this, the way we fix them depends on a point by point case. For example, if we wanted to solve the problem `min abs(f(x))` we could instead solve the problem `min y st y>=f(x) y>=-f(x)`. Assuming that `f` is differentiable, the new optimization problem is as well, but we now have to handle inequality constraints. For a soft variety, I like `sqrt(1+f(x)^2)-1`. In the deep learning field, step functions are replaced by sigmoids and linear rectifiers are replaced by soft rectifiers. And, yes, there are plenty of examples where soft rectifiers weren't used and things worked great. That's great when it works. I'll contend that such techniques don't have to break every time for it to be a problem.

Anyway, the bottom line is that nondifferentiability is a problem because virtually all gradient based optimization solvers require this property. Subgradients are only good enough if the underlying algorithm explicitly works with this information and most solvers don't. None of this means that everything has to break every time. Certainly, if we find an optimal solution away from points of nondifferentiability, we're good. However, it does create some really difficult to debug situations. Finally, in case it wasn't clear, yes, the article is talking about deep learning, but the kind of deep learning techniques under discussion are using optimization solvers to fit their particular model to the data, so the general discussion holds.