Great post, as an AD tutorial and as a (an?) Haskell exercise. Having known nothing about AD before, I feel like I have a good understanding of what it is -- as he says, it's so simple -- but I don't understand why the algorithm is so much faster. Just looking at the differentiator function and the AD function, it actually appears that the AD should take longer because it does more computation per step (both the function and the derivative). But it seems like every article or paper is talking about how to implement AD, not why the algorithm is so efficient. Does anyone happen to know of a good article or paper about that? Ideally, one just as nice and comprehensible as this!
The first alternative builds a large tree structure, and then evaluates the whole tree structure afterwards.
So first it blows up the size of the expression to process and then it calculates the result. A lot of those calculations will be redundant
The second one not only avoids evaluating the tree separately, but "prunes" a lot of the evaluation automatically by effectively short-circuiting the whole process. Consider (with the caveat that my Haskell understanding is rudimentary at best and it was about 20 years since I last did anything involving symbolic differentiation) :
Product e e' -> Product e (diff e') `Sum` Product (diff e) e'
Followed by a separate run with:
Product e e' -> ev e * ev e'
vs
Product e e' -> let (ex, ed) = ev e
(ex', ed') = ev e'
in (ex * ex', ex * ed' + ed * ex')
(I pick the "Product" rule as an example because it is one of the ones that blows up the size of the tree)
Lets say you do something simple like Product X X. You get Sum (Product X One) (Product One X) out, and then you have to evaluate each node.
In the second case, you match Product e e'. You process X and assign (x,1) to (ex,ed), and process the second X and assign (x,1) to (ex', ed'), and then return (ex * ex', ex * ed' + ed * ex').
In the first case, you've first differentiated 3 nodes (Product + 2x "X"), then evaluated the 7 nodes that was produced as output, for a total of ten nodes processed.
In the second you've evaluated/differentatiated 3 nodes in one go without the intermediate step of having to evaluate a much larger tree.
In a large example, the number of nodes in the differentiated output quickly explodes and subsequent evaluation would increase rapidly in cost.
In american english, this would be "a Haskell exercise," pronouncing "a" to rhyme with "hey" or "ah." In british english, it would be "an 'askell exercise."
I'm american, though. British people please weigh in.
> For fans of Lisp, there is no question that one motivation is to show how easy it is to implement in Lisp.
Lisp provides a natural representation for programs as data and a natural form for writing programs that
write programs, which is what we do in ADIL. The code is short, and is in ANSI standard Common Lisp.
Since it is not "naive" code written in just the obvious idioms of introductory Lisp, it illustrates, for those
with only a cursory familiarity, that Lisp is more than
CAR
and
CDR. In fact we did not use those routines
at all.
Ah, but you did use A and D. cAr-tomatic cDr-ivation!
Algebraically, automatic differentiation is the same as adding a nilpotent element e, such that e^2=0 to your algebra. You can continue this pattern out to get higher order derivatives. For example, if you also add an element f where f^3=0, the coefficient of f is proportional to the second derivative.
This example makes it really clear what's going on. Could someone translate it to do reverse automatic differentiation? That's the one I never quite understand.
You mean integration? That's much harder, but has been done automatically. Symbolic differentiation is easy, because you can just keep blindly applying a set of rewrite rules until none of them apply. That process converges on a unique result. Symbolic integration doesn't converge in that way. More strategy is required, and you're not guaranteed a closed form solution. Mathematica has a good symbolic integrator.
The short answer is no, for a couple of reasons. First, you have to distinguish whether you want the antiderivative or the integral. In the first case, the problem with finding the antiderivative is that it's not unique. For example, the antiderivative of x can be 0.5 * x^2 or 0.5 * x^2 + 1 or 0.5 * x^2 + 2, etc. Basically, taking derivatives can lose information and we don't know how to put that information back. In the second case, we have the integral, but that creates problems because the integral really needs to have a domain defined to make sense. Basically, over what values are we integrating? In 1-D we can specify this with a bunch of intervals, but in 2 and higher dimensions, it becomes difficult. In fact, this is partially why we have meshing tools like Cubit or gmesh. Yes, we use them with finite element methods, but really, at it's core, we need them to break up a domain because we want to integrate the pieces.
Anyway, with enough information or on a simple enough setup, both can actually be done, but the general case isn't super well defined, which is why the simple answer is no.
The thing that's great about the typeclass approach, you can do anything you want behind the implementation. you can numerically evaluate the expression, but even cooler, you can recover the parse tree. I never could sort out how to deal with 'if', because it's not typeclassed. if it was, boy could you do some amazing stuff. partial differentiation, tree rewriting, with the LLVM stuff you could runtime compile arbitrary functions. super neat trick.
That's also what's great about pure dynamic systems like Smalltalk. 'ifTrue:' is just a message-send, the class "False" doesn't evaluate the block parameter, the class "True" does. And yes, you can then recover the parse tree, for example the Gemstone OODB lets you use arbitrary blocks (closures) as queries, recovers the parse tree and then creates an optimized query from it. Quite neat.
Could Automatic Differentiation learn by applying learnable weights to the compiled S-expression atoms - Backpropagating Errors applied with Stochastic Gradient Descent ?
A program would constrain a task with functional statements, which is then compiled to weighted s-expressions which learn the specific task from training data.
Dual numbers are just a means of formalizing the properties of an epsilon (a tiny change in calculus), and are the means of preserving enough information to think of the function and its derivative at the same time. EG: (x + ε)² = x² + 2ε + ε², but ε² = 0, so we get x² + 2ε, (a tiny change squared becomes an insignificantly tiny change)
[+] [-] johnbender|10 years ago|reply
http://matt.might.net/articles/parsing-with-derivatives/
[+] [-] unknown|10 years ago|reply
[deleted]
[+] [-] Apanatshka|10 years ago|reply
[+] [-] delluminatus|10 years ago|reply
[+] [-] vidarh|10 years ago|reply
So first it blows up the size of the expression to process and then it calculates the result. A lot of those calculations will be redundant
The second one not only avoids evaluating the tree separately, but "prunes" a lot of the evaluation automatically by effectively short-circuiting the whole process. Consider (with the caveat that my Haskell understanding is rudimentary at best and it was about 20 years since I last did anything involving symbolic differentiation) :
Followed by a separate run with: vs (I pick the "Product" rule as an example because it is one of the ones that blows up the size of the tree)Lets say you do something simple like Product X X. You get Sum (Product X One) (Product One X) out, and then you have to evaluate each node.
In the second case, you match Product e e'. You process X and assign (x,1) to (ex,ed), and process the second X and assign (x,1) to (ex', ed'), and then return (ex * ex', ex * ed' + ed * ex').
In the first case, you've first differentiated 3 nodes (Product + 2x "X"), then evaluated the 7 nodes that was produced as output, for a total of ten nodes processed.
In the second you've evaluated/differentatiated 3 nodes in one go without the intermediate step of having to evaluate a much larger tree.
In a large example, the number of nodes in the differentiated output quickly explodes and subsequent evaluation would increase rapidly in cost.
[+] [-] pjscott|10 years ago|reply
https://colah.github.io/posts/2015-08-Backprop/
[+] [-] delish|10 years ago|reply
In american english, this would be "a Haskell exercise," pronouncing "a" to rhyme with "hey" or "ah." In british english, it would be "an 'askell exercise."
I'm american, though. British people please weigh in.
[+] [-] tome|10 years ago|reply
I hope this helps!
http://h2.jaguarpaw.co.uk/posts/why-is-naive-symbolic-differ...
[+] [-] kazinator|10 years ago|reply
Aha, here it is: http://www.cs.berkeley.edu/~fateman/papers/ADIL.pdf
> For fans of Lisp, there is no question that one motivation is to show how easy it is to implement in Lisp. Lisp provides a natural representation for programs as data and a natural form for writing programs that write programs, which is what we do in ADIL. The code is short, and is in ANSI standard Common Lisp. Since it is not "naive" code written in just the obvious idioms of introductory Lisp, it illustrates, for those with only a cursory familiarity, that Lisp is more than CAR and CDR. In fact we did not use those routines at all.
Ah, but you did use A and D. cAr-tomatic cDr-ivation!
:)
[+] [-] davexunit|10 years ago|reply
[+] [-] 33a|10 years ago|reply
[+] [-] amelius|10 years ago|reply
[+] [-] turkishrevenge|10 years ago|reply
[+] [-] nwhitehead|10 years ago|reply
[+] [-] Animats|10 years ago|reply
[+] [-] kxyvr|10 years ago|reply
Anyway, with enough information or on a simple enough setup, both can actually be done, but the general case isn't super well defined, which is why the simple answer is no.
[+] [-] eru|10 years ago|reply
[+] [-] jfoutz|10 years ago|reply
[+] [-] mpweiher|10 years ago|reply
[+] [-] eru|10 years ago|reply
The problems occur when you want to typeclass your lambdas. The difficulties are mostly surmountable, though. Check out http://okmij.org/ftp/tagless-final/course/
[+] [-] mpweiher|10 years ago|reply
[deleted]
[+] [-] finin|10 years ago|reply
https://archive.org/stream/bitsavers_mitaiaimAI_878286/AIM-0...
ftp://publications.ai.mit.edu/ai-publications/pdf/AIM-010.pdf
[+] [-] deepnet|10 years ago|reply
A program would constrain a task with functional statements, which is then compiled to weighted s-expressions which learn the specific task from training data.
A sort of Neural-net functional program hybrid.
[+] [-] notthemessiah|10 years ago|reply
[+] [-] platz|10 years ago|reply
[+] [-] guest1539|10 years ago|reply