(no title)
scarecrow112 | 6 years ago
[1] - https://www.youtube.com/watch?v=z8GyNneq5D4
[2] - https://stackoverflow.com/a/45134000
[3] - https://github.com/google/tangent
EDIT: 1. Added reference. 2. Formatting
scarecrow112 | 6 years ago
[1] - https://www.youtube.com/watch?v=z8GyNneq5D4
[2] - https://stackoverflow.com/a/45134000
[3] - https://github.com/google/tangent
EDIT: 1. Added reference. 2. Formatting
ddragon|6 years ago
The original tensorflow work the same, but instead of running the graph every time, it embeds the non differentiable control mechanism in the code of the graph, which can more efficiently (without needing the host language to build a new one every time) create the correct differentiable tape for each run based on it's input. And source-to-source differentiation work exactly the same way, except instead of having to use a DSL (like the tensorflow graph API) and compile it, it simply uses the host language and compiler directly (so you don't need effectively two languages). Which is the case of Julia's Zygote and Swift for Tensorflow.
The only alternative to this piecewise differentiation that I know of would be creating a soft version of discrete operators, such as replacing step functions with sigmoids and case/switch/elsif operators as softmax selectors for example, which is not what any of those libraries do (it would not be easy to make it converge as the graph would be much more complex at each backward pass). In this case you could have one single graph that includes every branch though.
dimatura|6 years ago
AbrahamParangi|6 years ago
maffydub|6 years ago
> a differentiable function of one real variable is a function whose derivative exists at each point in its domain
Most programs aren't differentiable functions because conditionals aren't in general differentiable. For example "if x > 0 { 1 } else { -1 }" doesn't have a derivative at 0 (and so, by the definition above) isn't differentiable at 0.
...or have I missed something?
jing|6 years ago
mlthoughts2018|6 years ago
For example, when processing images that themselves may be intensity functions over a spatial domain, and the intensity function can have cusps, edges, occlusions, etc., then you need different weak-sense differentiability conditions, such as from Sobolev spaces, to guarantee that numerical gradient operators will succeed / converge where needed.
https://en.m.wikipedia.org/wiki/Sobolev_space
bigred100|6 years ago
Whatever your computer actually computes is more or less a bunch of pluses, minuses, multiplications, divisions. The function you want to compute may not be differentiable, but the sequence of arithmetic operations you actually compute to arrive at your output should basically be differentiable by the chain rule. Then an AD tool should let you differentiate that.
It seems likely to me that there’s nothing extraordinarily good about this technique without further discussion of the specific problem application given that the field of derivative free optimization exists and is actively researched. I don’t really know why those guys would bother if AD (not a new thing) supplanted their field.
In numerical computing in general I’ve seen quite a bit the idea that you can either differentiate the thing you’re trying to optimize, then use a numerical method to evaluate the derivative, or use a numerical method to evaluate the thing you want to optimize and then differentiate the numerical method. I’m not expert enough to provide any real analysis of when to use each.
jsinai|6 years ago
This begs the question, what is the derivative of a programme? At its simplest (and I guess this is what the OP was trying to say), any computer programme can be reduced to arithmetic and also logical control flow operations, and thus the derivative of a programme involves calculating the derivatives each computation along all computation paths.
I'm no expert but this talk by Simon Peyton-Jones should make the concept much clearer:
https://2019.ecoop.org/details/ecoop-2019-papers/11/Automati...
kxyvr|6 years ago
There are plenty of nondifferentiable, elementary functions. For example, `sqrt` is nondifferentiable at 0. Depending on if you want to include it as an elementary function, `abs` is another. Now, what people get away with is that really pathological functions such as the Dirichlet function aren't readily representable on a computer. Most of the time, points of discontinuity or nondifferentiability occur at a small finite number of locations. As such, we pretend like we can just ignore these points.
The unfortunate part is that these points of nondifferentiability to come up quite often because a lot of the interesting action in a function occurs near these points. Take `abs`, if we were doing something like `min abs(x)` we'd be cruising along moving toward the origin until we hit the kink, which turns out to be the interesting part of this function. At this point, most algorithms will have trouble.
Now, we could also say that, "Well, a subderivative/subgradient exists and the routine can just return that and it will be fine." That's all well and true if you're using an optimization algorithm that correctly deals with subderivatives and is notified of them. Most optimization algorithms do not handle subderivatives. The ones that do generally need to know that there are many subderivatives in a particular location and get to choose the one they want. A typical AD routine will not return this information.
If you want an example from the deep learning field, look at the old sigmoid activation functions. If we set aside the pretense that this is a simulation of the brain, sigmoids are a differentiable representation of the step (Heaviside) function. Recall how these algorithms had trouble with vanishing gradients? Well, the derivative of the step function is zero everywhere except at the origin where it doesn't exist. How are we supposed to optimize with that? We can't. That's why we used sigmoid functions, at the time.
Ok, does that mean we can't use nondifferentiable functions when optimizing? No. However, I would encourage everyone to clearly document the points of nondifferentiability because it makes debugging these issues vastly easier. Outside of this, the way we fix them depends on a point by point case. For example, if we wanted to solve the problem `min abs(f(x))` we could instead solve the problem `min y st y>=f(x) y>=-f(x)`. Assuming that `f` is differentiable, the new optimization problem is as well, but we now have to handle inequality constraints. For a soft variety, I like `sqrt(1+f(x)^2)-1`. In the deep learning field, step functions are replaced by sigmoids and linear rectifiers are replaced by soft rectifiers. And, yes, there are plenty of examples where soft rectifiers weren't used and things worked great. That's great when it works. I'll contend that such techniques don't have to break every time for it to be a problem.
Anyway, the bottom line is that nondifferentiability is a problem because virtually all gradient based optimization solvers require this property. Subgradients are only good enough if the underlying algorithm explicitly works with this information and most solvers don't. None of this means that everything has to break every time. Certainly, if we find an optimal solution away from points of nondifferentiability, we're good. However, it does create some really difficult to debug situations. Finally, in case it wasn't clear, yes, the article is talking about deep learning, but the kind of deep learning techniques under discussion are using optimization solvers to fit their particular model to the data, so the general discussion holds.
samcodes|6 years ago
http://www.juliadiff.org/