Symbolic expressions can be automatically differentiated too

[+] johnbender|10 years ago|reply

One can also calculate the derivative of a context free grammar with respect to a given terminal.

http://matt.might.net/articles/parsing-with-derivatives/

[+] unknown|10 years ago|reply

[deleted]

[+] Apanatshka|10 years ago|reply

That's also a really cool article. Thanks for sharing it!

[+] delluminatus|10 years ago|reply

Great post, as an AD tutorial and as a (an?) Haskell exercise. Having known nothing about AD before, I feel like I have a good understanding of what it is -- as he says, it's so simple -- but I don't understand why the algorithm is so much faster. Just looking at the differentiator function and the AD function, it actually appears that the AD should take longer because it does more computation per step (both the function and the derivative). But it seems like every article or paper is talking about how to implement AD, not why the algorithm is so efficient. Does anyone happen to know of a good article or paper about that? Ideally, one just as nice and comprehensible as this!

[+] vidarh|10 years ago|reply

The first alternative builds a large tree structure, and then evaluates the whole tree structure afterwards.

So first it blows up the size of the expression to process and then it calculates the result. A lot of those calculations will be redundant

The second one not only avoids evaluating the tree separately, but "prunes" a lot of the evaluation automatically by effectively short-circuiting the whole process. Consider (with the caveat that my Haskell understanding is rudimentary at best and it was about 20 years since I last did anything involving symbolic differentiation) :

   Product e e' -> Product e (diff e') `Sum` Product (diff e) e'

Followed by a separate run with:

    Product e e' -> ev e * ev e'

vs

    Product e e' -> let (ex, ed)   = ev e
                            (ex', ed')       = ev e'
                    in (ex * ex', ex * ed' + ed * ex')

(I pick the "Product" rule as an example because it is one of the ones that blows up the size of the tree)

Lets say you do something simple like Product X X. You get Sum (Product X One) (Product One X) out, and then you have to evaluate each node.

In the second case, you match Product e e'. You process X and assign (x,1) to (ex,ed), and process the second X and assign (x,1) to (ex', ed'), and then return (ex * ex', ex * ed' + ed * ex').

In the first case, you've first differentiated 3 nodes (Product + 2x "X"), then evaluated the 7 nodes that was produced as output, for a total of ten nodes processed.

In the second you've evaluated/differentatiated 3 nodes in one go without the intermediate step of having to evaluate a much larger tree.

In a large example, the number of nodes in the differentiated output quickly explodes and subsequent evaluation would increase rapidly in cost.

[+] pjscott|10 years ago|reply

The clearest article I've seen on the subject is this one:

https://colah.github.io/posts/2015-08-Backprop/

[+] delish|10 years ago|reply

> as a (an?) Haskell exercise

In american english, this would be "a Haskell exercise," pronouncing "a" to rhyme with "hey" or "ah." In british english, it would be "an 'askell exercise."

I'm american, though. British people please weigh in.

[+] tome|10 years ago|reply

> I don't understand why the algorithm is so much faster

I hope this helps!

http://h2.jaguarpaw.co.uk/posts/why-is-naive-symbolic-differ...

[+] kazinator|10 years ago|reply

I would swear I read some Lisp-related paper about this with some nice Lisp advocacy in it, too.

Aha, here it is: http://www.cs.berkeley.edu/~fateman/papers/ADIL.pdf

> For fans of Lisp, there is no question that one motivation is to show how easy it is to implement in Lisp. Lisp provides a natural representation for programs as data and a natural form for writing programs that write programs, which is what we do in ADIL. The code is short, and is in ANSI standard Common Lisp. Since it is not "naive" code written in just the obvious idioms of introductory Lisp, it illustrates, for those with only a cursory familiarity, that Lisp is more than CAR and CDR. In fact we did not use those routines at all.

Ah, but you did use A and D. cAr-tomatic cDr-ivation!

:)

[+] davexunit|10 years ago|reply

Surprised to not see SICP's excellent section on symbolic differentiation not mentioned here: http://sarabander.github.io/sicp/html/2_002e3.xhtml#g_t2_002...

[+] 33a|10 years ago|reply

Algebraically, automatic differentiation is the same as adding a nilpotent element e, such that e^2=0 to your algebra. You can continue this pattern out to get higher order derivatives. For example, if you also add an element f where f^3=0, the coefficient of f is proportional to the second derivative.

[+] amelius|10 years ago|reply

Sounds interesting. But could you explain this so that people without a postgraduate degree in mathematics can understand this?

[+] turkishrevenge|10 years ago|reply

Conal Elliot's paper on the subject is a really good starting point: http://conal.net/papers/beautiful-differentiation/beautiful-...

[+] nwhitehead|10 years ago|reply

This example makes it really clear what's going on. Could someone translate it to do reverse automatic differentiation? That's the one I never quite understand.

[+] Animats|10 years ago|reply

You mean integration? That's much harder, but has been done automatically. Symbolic differentiation is easy, because you can just keep blindly applying a set of rewrite rules until none of them apply. That process converges on a unique result. Symbolic integration doesn't converge in that way. More strategy is required, and you're not guaranteed a closed form solution. Mathematica has a good symbolic integrator.

[+] kxyvr|10 years ago|reply

The short answer is no, for a couple of reasons. First, you have to distinguish whether you want the antiderivative or the integral. In the first case, the problem with finding the antiderivative is that it's not unique. For example, the antiderivative of x can be 0.5 * x^2 or 0.5 * x^2 + 1 or 0.5 * x^2 + 2, etc. Basically, taking derivatives can lose information and we don't know how to put that information back. In the second case, we have the integral, but that creates problems because the integral really needs to have a domain defined to make sense. Basically, over what values are we integrating? In 1-D we can specify this with a bunch of intervals, but in 2 and higher dimensions, it becomes difficult. In fact, this is partially why we have meshing tools like Cubit or gmesh. Yes, we use them with finite element methods, but really, at it's core, we need them to break up a domain because we want to integrate the pieces.

Anyway, with enough information or on a simple enough setup, both can actually be done, but the general case isn't super well defined, which is why the simple answer is no.

[+] eru|10 years ago|reply

Great article! It might benefit from a comparison with Oleg's Typed tagless-final interpretations (http://okmij.org/ftp/tagless-final/course/).

[+] jfoutz|10 years ago|reply

The thing that's great about the typeclass approach, you can do anything you want behind the implementation. you can numerically evaluate the expression, but even cooler, you can recover the parse tree. I never could sort out how to deal with 'if', because it's not typeclassed. if it was, boy could you do some amazing stuff. partial differentiation, tree rewriting, with the LLVM stuff you could runtime compile arbitrary functions. super neat trick.

[+] mpweiher|10 years ago|reply

That's also what's great about pure dynamic systems like Smalltalk. 'ifTrue:' is just a message-send, the class "False" doesn't evaluate the block parameter, the class "True" does. And yes, you can then recover the parse tree, for example the Gemstone OODB lets you use arbitrary blocks (closures) as queries, recovers the parse tree and then creates an optimized query from it. Quite neat.

[+] eru|10 years ago|reply

You can make a type-classed `if' without any problem.

The problems occur when you want to typeclass your lambdas. The difficulties are mostly surmountable, though. Check out http://okmij.org/ftp/tagless-final/course/

[+] mpweiher|10 years ago|reply

[deleted]

[+] finin|10 years ago|reply

THE LISP DIFFERENTIATION DEMONSTRATION PROGRAM, K. Kaling, Artificial Intelligence Project, RLE and MIT Computation Center, AI Memo 10, 1959.

https://archive.org/stream/bitsavers_mitaiaimAI_878286/AIM-0...

ftp://publications.ai.mit.edu/ai-publications/pdf/AIM-010.pdf

[+] deepnet|10 years ago|reply

Could Automatic Differentiation learn by applying learnable weights to the compiled S-expression atoms - Backpropagating Errors applied with Stochastic Gradient Descent ?

A program would constrain a task with functional statements, which is then compiled to weighted s-expressions which learn the specific task from training data.

A sort of Neural-net functional program hybrid.

[+] notthemessiah|10 years ago|reply

Dual numbers are just a means of formalizing the properties of an epsilon (a tiny change in calculus), and are the means of preserving enough information to think of the function and its derivative at the same time. EG: (x + ε)² = x² + 2ε + ε², but ε² = 0, so we get x² + 2ε, (a tiny change squared becomes an insignificantly tiny change)

[+] platz|10 years ago|reply

Forward-mode AD doesnt't really scale. Reverse kode AD is useful for the backpropigation algo in machinelearning however

[+] guest1539|10 years ago|reply

What part doesn't scale?

35 comments