top | item 39795486

(no title)

MikeBattaglia | 1 year ago

One very interesting thing about automatic differentiation is that you can think of it as involving a new algebra, similar to the complex numbers, where we adjoin an extra element to the reals to form a plane. This new algebra is called the ring of "dual numbers." The difference is that instead of adding a new element "i" with i² = -1, we add one called "h" with h² = 0!

Every element in the dual numbers is of the form a + bh, and in fact the entire ring can be turned into a totally ordered ring in a very natural way: simply declare h < r for any real r > 0. In essence, we are saying h is an infinitesimal - so small that its square is 0. So we have a non-Archimedean ring with infinitesimals - the smallest such ring extending the real numbers.

Why is this so important? Well, if you have some function f which can be extended to the dual number plane - which many can, similar to the complex plane - we have

f(x+h) = f(x) + f'(x)h

Which is little more than restating the usual definition of the derivative: f'(x) = (f(x+h) - f(x))/h

For instance, suppose we have f(x) = 2x² - 3x + 1, then

f(x+h) = 2(x+h)² - 3(x+h) + 1 = 2(x² + 2xh + h²) - 3(x+h) + 1 = (2x² - 3x + 1) + (4x - 3)h

Where the last step just involves rearranging terms and canceling out the h² = 0 term. Note that the expression for the derivative we get, (4x-3), is correct, and magically computed itself straight from the properties of the algebra.

In short, just like creating i² = -1 revolutionized algebra, setting h² = 0 revolutionizes calculus. Most autodiff packages (such as Pytorch) use something not much more advanced than this, although there are optimizations to speed it up (e.g. reverse mode diff).

discuss

order

kristjansson|1 year ago

Dual numbers implement forward mode automatic differentiation, but is there additional value to viewing AD in terms of duals e.g. when we’re implementing reverse mode (backprop)?

naasking|1 year ago

Depends what you mean by "additional value". Dual numbers are very simple, and enough for reverse mode AD too though, even via a purely functional implementation:

Provably Correct, Asymptotically Efficient, Higher-Order Reverse-Mode Automatic Differentiation, https://dl.acm.org/doi/pdf/10.1145/3498710

MikeBattaglia|1 year ago

This is a good question that doesn't have a short answer. There are some different philosophical opinions about this.

One way to look at this is to note that even with forward mode autodiff, there have generally historically been two different viewpoints for the whole thing, which I'll call the "computer science" view and the "algebraic" view.

The computer science view involves things that look computational graphs with nodes that look like {"value": 123, "deriv": 456}. We are storing real values along with propagated partial derivatives. We have a custom * operator which sets a * b = {"value": a.value * b.value, "deriv": a.value * b.deriv + b.value * a.deriv}. Other functions, like exp, sin, cos, log, etc are also extended to handle these kinds of input. There's a ton of literature that views things using this kind of framework going back to the 1960s.

The algebraic view uses dual numbers. Instead of writing {"value": 123, "deriv": 456}, we write 123 + 456h. We get the same results as the above: (a + bh) * (c + dh) = ac + (ad + bc)h. We can extend many functions in a natural way to the dual plane, such as exp, sin, cos, log, and get values there. There's also plenty of literature on these, going back to the late 1800s.

A modern view is to note that these two things are *identical*. It isn't that forward mode autodiff "uses" dual numbers; it is dual numbers. The set of elements of the form {"value": x, "deriv": y}, with addition and multiplication as stated, satisfies the axioms of a real algebra and is isomorphic to the dual numbers. We could have written {"real_part": x, "dual_part": y} if we wanted. You can see this viewpoint in some of the links I've posted elsewhere here.^[1]

So given all of that background, there are two answers to your original question. The first is to just simply view it as that "reverse-mode autodiff doesn't use dual numbers." Many people have this view, and I would say that that it really focuses on what I've called the "computer science" view above.

The second view is to note that the relationship between reverse-mode autodiff and dual numbers is the same as the relationship between reverse-mode autodiff and forward-mode autodiff. It would be silly to say that they are totally different, unrelated things: at the end of the day all we are really doing is changing the order in which we perform a bunch of multiplies of Jacobian matrices. I tend to view it as similar to the relationship between the DFT and the FFT: there is this super elegant linear algebra view involving DFT matrices. Do we say that the FFT "doesn't use matrices?" Well, I guess, but are we going to go so far as to say that it also doesn't involve linear algebra, etc? That is my view.

There are a few other differences between reverse-mode and forward-mode autodiff. Each individual operation in the computation graph, for instance, can be thought of as an individual instance of forward-mode autodiff. In reverse-mode, on the other hand, we typically store the (back-)propagated adjoints as additional data on the input node objects, not the output nodes. This is useful if we are thinking of backpropagation on a graph. It's up to you if you view these as involving materially different theories or just differences of implementation for the sake of optimization.

In short, the main thing is that there's less literature on a purely algebraic version of reverse-mode autodiff in general.

[1] There is one important difference: we often think of the dual numbers as an ordered ring with "h" infinitesimal. The first viewpoint doesn't use this part of the mathematical structure - and it's very interesting to note that it isn't even necessary! Although it's cute to think of h as infinitesimal, the autodiff properties we get flow purely from the algebraic properties of having h^2 = 0, regardless of order. Of course, though, we can always just think of the dual numbers as an unordered ring, if we want.

compacct27|1 year ago

Where do I go to learn what you just said?

smokel|1 year ago

The book referred to in this post has some information about this. The method with "dual numbers" is called "forward-mode automatic differentiation". PyTorch seems to use "reverse-mode automatic differentiation", which does not use dual numbers, but keeps track of the computation graph.

clooper|1 year ago

Your are confusing non-standard and dual numbers. The dual numbers are not ordered and contain non-invertible nilpotent elements such as h which squares to 0.

MikeBattaglia|1 year ago

These things are different and I did mean dual numbers. The dual numbers do form an ordered ring. When you complain about certain elements being non-invertible, I think you are probably complaining that they aren't a field.

fpgamlirfanboy|1 year ago

> Most autodiff packages (such as Pytorch) use something not much more advanced than this

pytorch absolutely does not use the dual number formulation - there are absolutely no magic epsilons anywhere in pytorch's (or tensorflow's) code base. what you're calling duals are the adjoints where are indeed stored/cached on every node in pytorch graphs.

there's a reason no one uses dual numbers (non-standard analysis) for anything (neither autodiff nor calculus itself): because manipulating infinitesmals like this is fraught formal manipulation (it's algebra...) where as limits are much more rigorous (bounds, inequalities, convergence, etc.). my favorite question to ask the non-standard analysis n00bs is: please tell me under what conditions this is true

(dx/dy)(dy/dz)(dz/dx) = 1

edit:

anyone that thinks i'm wrong and this other guy is right should go and do some reading, eg where this guy tried to make this same point and got shot down:

https://math.stackexchange.com/a/341550

spoiler alert: there's a reason you had to learn epsilon-delta proofs and limits and it's not because your math professors are mean.

this is why i hate this kind of "TIL, gee whiz" math tidbits - they're full of exclamation marks and fancy sounding words ("non-archimedean rings" oooo fancy) but almost always come from a wikipedia level understanding, not actual research.

MikeBattaglia|1 year ago

At the end of the day, if you are storing inputs and outputs to a function as a pair of numbers - one for the actual value, and one for the derivative - and if addition and multiplication work the way you expect and propagate derivatives correctly - then you are using dual numbers, regardless of if you notate it a + b*h or {"value": a, "derivative": b}.

Pytorch does things slightly differently in that it is mostly focused on reverse-mode autodiff, and so it stores adjoints relative to the overall output rather than partial derivatives relative to the input, but this isn't really an entirely different thing, in the same way that the FFT isn't entirely different from the DFT.

There seems to be some confusion about the relationship between dual numbers and smooth infinitesimal analysis. Both have nilpotent elements, but with dual numbers the background logic is classical, whereas it isn't with smooth infinitesimal analysis.

EDIT: I see you've edited your post to try to get in some extra criticism after I've already responded. That's terrible form, so I'll just respond here.

Dual numbers are a nice way to get started with forward-mode autodiff, to which it is so related that the two are essentially the same thing with different labels. Pytorch instead uses reverse-mode autodiff. Reverse-mode and forward-mode autodiff are different, but not so different that they are entirely different things. Reverse-mode is, as I put it in my OP, "not much more advanced" than forward-mode, even if not identical.

What is entirely different, much more advanced, and what Pytorch really doesn't do, is anything like the "epsilon-delta proofs" you keep hanging your hat on. If Pytorch did that, it would be useless. The entire point of autodiff is to avoid such things.

Beyond that, I would suggest slowing down a bit as you are mixing quite a few things up. Nonstandard analysis has nothing to do with dual numbers at all, for instance. And you're very much misinterpreting that MSE post of mine you linked to (thanks!).

SideQuark|1 year ago

You've filled this page with comments on non-standard analysis, but the dual numbers have precisely zero to do with it. Calling people n00bs on a topic you apparently do no understand is silly.

Non-standard analysis deals with fields only, and the dual numbers are not a field, there are infinitely many zero divisors.

You should read the wiki pages on both, then maybe this mathoverflow post explaining it. The clearest way to maybe grasp the difference for you is that in any formulation of non-standard analysis, the square of any infinitesimal is another infinitesimal, and never 0. In the dual numbers, the square of any infinitesimal is always precisely, exactly zero.

They are so fundamentally different that anyone (like you) that claims to be so cognizant of either would never repeat they are the same as loudly and frequently as you are.

https://math.stackexchange.com/questions/341535/is-the-theor...

bollu|1 year ago

You're right on one count: no library implements forward mode. Hence, you're correct that no autodiff library (including pytorch) implement autodiff this way.

However, *you're wrong* that forward mode cannot be written in terms of dual numbers. The point is that the addition and multiplication operation for dual numbers correspond exactly to rules of the derivative of addition and the derivative of the product.

noqc|1 year ago

Nilpotents aren't exactly new.