Differentiable Programming – A Simple Introduction

infogulch|3 years ago

The most interesting thing I've seen on AD is "The simple essence of automatic differentiation" (2018) [1]. See past discussion [2], and talk [3]. I think the main idea is that by compiling to categories and pairing up a function with its derivative, the pair becomes trivially composable in forward mode, and the whole structure is easily converted to reverse mode afterwards.

[1]: https://dl.acm.org/doi/10.1145/3236765

[2]: https://news.ycombinator.com/item?id=18306860

[3]: Talk at Microsoft Research: https://www.youtube.com/watch?v=ne99laPUxN4 Other presentations listed here: https://github.com/conal/essence-of-ad

tome|3 years ago

> the whole structure is easily converted to reverse mode afterwards.

Unfortunately it's not. Elliot never actually demonstrates in the paper how to implement such an algorithm, and it's very hard to write compiler transformations in "categorical form".

(Disclosure: I'm the other of another paper on AD.)

yauneyz|3 years ago

My professor has talked about this. He thinks that the real gem of the deep learning revolution is the ability to take the derivative of arbitrary code and use that to optimize. Deep learning is just one application of that, but there are tons more.

SleekEagle|3 years ago

That's part of why Julia is so exciting! Building it specifically to be a differentiable programming language opens so many doors ...

melony|3 years ago

I am just happy that the previously siloed fields of operations research and various control theory sub-disciplines are now incentivized to pool their research together thanks to the funding in ML. Also many expensive and proprietary optimization software in industry are finally getting some competition.

potbelly83|3 years ago

How do you differentiate a string? Enum?

choeger|3 years ago

Nice article, but the intro is a little lengthy.

I have one remark, though: If your language allows for automatic differentiation already, why do you bother with a neural network in the first place?

I think you should have a good reason why you choose a neural network for your approximation of the inverse function and why it has exactly that amount of layers. For instance, why shouldn't a simple polynomial suffice? Could it be that your neural network ends up as an approximation of the Taylor expansion of your inverse function?

ChrisRackauckas|3 years ago

It's a good question, which I answer in much more depth here: https://www.reddit.com/r/MachineLearning/comments/ryw53x/d_f... . Specifically that answer is about NNs vs fourier series, but it's the same point: polynomials, Fourier series, and NNs are all universal function approximators, so why use a NN? If you write out what happens though, you can easily see the curse of dimensionality in action and see why a 100-dimensional function shouldn't be approximated by tensor products of polynomials. See the rest there.

So it's not really about NNs vs polynomials/fourier/Chebyshev, etc., but rather what is the right thing to use at a given time. In the universal differential equations paper (https://arxiv.org/abs/2001.04385), we demonstrate that some examples of automatic equation discovery are faster using Fourier series than NNs (specifically there the discovery of a semilinear partial differential equation). That doesn't say that NNs are a bad way to do all of this, it just means that they are one trainable object that is good for high dimensional approximations, but libraries should allow you to easily move between classical basis functions and NNs to best achieve the highest performance.

SleekEagle|3 years ago

I think for more complicated examples like RL control systems a neural network is the natural choice. If you can incorporate physics into your world model then you'd need differentiable programming + NNs, right? Or am I misunderstanding the question.

If you're talking about the specific cannon problem, you don't need to do any learning at all you can just solve the kinematics, so in some sense you could ask why you're using any approximation function,

simulate-me|3 years ago

Requiring an exact justification for a specific NN architecture is not productive. Simply put, such justification almost never exists, yet NN clearly out-perform simple polynomials in a wide variety of tasks. Even if you know that a NN performs better than a simple polynomial on a task, you're very unlikely to get an exact explanation on why a specific NN was chosen vs. the infinite number of other possible NNs. If you require such an explanation, then you're going to miss out on a lot of performance.

potatoman22|3 years ago

Most of the time we hear about neural networks because the linear models didn't work.

adgjlsfhk1|3 years ago

in low dimensional spaces, there are a lot of good techniques, but once you go above 10 or so, NN is generally the only option. Pretty much everything else has exponential degradation with more dimensions.

PartiallyTyped|3 years ago

The nice thing about differentiable programming is that we can use all sorts of different optimizers compared to gradient descent that can offer quadratic convergence instead of linear!

SleekEagle|3 years ago

Yes exactly! This is huge. Hessian optimization is really easy with JAX, haven't tried it in Julia though

applgo443|3 years ago

Why can't we use this quadratic convergence in deep learning?

fennecs|3 years ago

Does someone have an example where the ability to “differentiate” a program gets you something interesting?

I understand perfectly what it means for a neural network, but how about more abstract things.

Im not even sure as currently presented, the implementation actually means something. What is the derivative of a function like List, or Sort or GroupBy etc? These articles all assume that somehow it just looks like derivative from calculus somehow.

Approximating everything as some non smooth real function doesn’t seem entirely morally correct. A program is more discrete or synthetic. I think it should be a bit more algebraic flavoured, like differentials over a ring.

fghorow|3 years ago

At first glance, this approach appears to re-invent an applied mathematics approach to optimal control. There, one writes a generalized Hamiltonian, from which forward and backward-in-time paths can be iterated.

The Pontryagin maximum (or minumum, if you define your objective function with a minus sign) principle is the essence to that approach to optimal control.

SleekEagle|3 years ago

I've never heard of Pontryagin's maximum principle, but thanks for bringing it to my attention. I think my knowledge of dynamical systems and control theory isn't quite up-to-snuff at the moment to fully understand it, but it's bookmarked for another day! thanks again

noobermin|3 years ago

The article is okay but it would have helped to have labelled the axes of the graphs.

49 comments