top | item 39795813

(no title)

At the end of the day, if you are storing inputs and outputs to a function as a pair of numbers - one for the actual value, and one for the derivative - and if addition and multiplication work the way you expect and propagate derivatives correctly - then you are using dual numbers, regardless of if you notate it a + b*h or {"value": a, "derivative": b}.

Pytorch does things slightly differently in that it is mostly focused on reverse-mode autodiff, and so it stores adjoints relative to the overall output rather than partial derivatives relative to the input, but this isn't really an entirely different thing, in the same way that the FFT isn't entirely different from the DFT.

There seems to be some confusion about the relationship between dual numbers and smooth infinitesimal analysis. Both have nilpotent elements, but with dual numbers the background logic is classical, whereas it isn't with smooth infinitesimal analysis.

EDIT: I see you've edited your post to try to get in some extra criticism after I've already responded. That's terrible form, so I'll just respond here.

Dual numbers are a nice way to get started with forward-mode autodiff, to which it is so related that the two are essentially the same thing with different labels. Pytorch instead uses reverse-mode autodiff. Reverse-mode and forward-mode autodiff are different, but not so different that they are entirely different things. Reverse-mode is, as I put it in my OP, "not much more advanced" than forward-mode, even if not identical.

What is entirely different, much more advanced, and what Pytorch really doesn't do, is anything like the "epsilon-delta proofs" you keep hanging your hat on. If Pytorch did that, it would be useless. The entire point of autodiff is to avoid such things.

Beyond that, I would suggest slowing down a bit as you are mixing quite a few things up. Nonstandard analysis has nothing to do with dual numbers at all, for instance. And you're very much misinterpreting that MSE post of mine you linked to (thanks!).

discuss

fpgamlirfanboy|1 year ago

> and if addition and multiplication work the way you expect and propagate derivatives correctly - then you are using dual numbers

you literally started out your miraculous comment with

> This new algebra is called the ring of "dual numbers." The difference is that instead of adding a new element "i" with i² = -1, we add one called "h" with h² = 0!

not some observation about caching derivatives.

so i'll repeat myself for the 3rd time: there are no magical numbers anywhere in pytorch or tensorflow or cafe or any other serious autodiff implementation that abide by the rules you so jubilantly exclaim about.

barfbagginus|1 year ago

See my comment to Mike. I think you're making a valid point here, which is that dual numbers by themselves are not powerful enough to automatically generate derivatives of arbitrary functions, especially given that those functions could be implemented in a FPU core, or using methods like lookup tables that don't lend themselves to dual number differentiation.

Dual numbers help us automatically differentiate things when the functions themselves are implemented as analytic power series that we have to explicitly compute without accelerator help. In such cases we can indeed use them. But to your point, serious forward AD engines need to differentiate functions that are computed in one shot by accelerator hardware.

However Mike makes a very valid counterpoint when he shows forward mode AD in Torch. I believe a careful analysis of Torch's implementation here could bring this conversation to a productive and satisfying conclusion for all participants and our public audience.

My big question here is to what degree did the implementers try to respect the dual number approach? Did they implement a dual tensor class for instance? Do they automatically lift some ordinary computations into dual tensor computations? I honestly have my doubts there.

I have confidence that we can get to the bottom of this. I think that Mike actually does care about automatic differentiation, and would be receptive to discussing this point of subtlety that naive dual number implementations may not be enough for industrial strength AD systems, with clear examples of code and clear reasoning as to how dual numbers fail in important cases.

MikeBattaglia|1 year ago

Thank you for repeating yourself three times. It seems like you think that the dual number algebra involves "magic woo numbers." It seems like you haven't really worked through this stuff too much. I would suggest reading some of the resources above, such as the MIT lecture series. The rest of your points I think I have already addressed, though you ignored in your reply - I've said Pytorch does reverse mode diff several times at this point.