wsmoses's comments

wsmoses | 5 years ago | on: Swift for TensorFlow Shuts Down

The name of the LLVM AD tool is actually Enzyme [http://enzyme.mit.edu/] (Zygote is a Julia tool)

wsmoses | 5 years ago | on: Enzyme – High-performance automatic differentiation of LLVM

Oh for sure, any ML framework worth its salt should do some amount of graph rewriting / transformations.

I was (perhaps poorly) trying to explain how while yes AD (regardless of implementation in Enzyme, PyTorch, etc) _can_ avoid caching values using clever tricks, they can't always get away with it. The cache-reduction optimizations really depend on the abstraction level chosen for what tools can do. If a tool can only represent the binary choice of whether an input is needed or not, it could miss out on the fact that perhaps only the first element (and not the whole array/tensor) is needed.

Regarding Enzyme v JaX/etc, again I think that's the wrong way to think about these tools. They solve problems at different levels and in fact can be used together for mutual benefit.

For example a high-level AD tool in a particular DSL might know that algebraically you don't need to compute the derivative of something since from the domain knowledge it is always a constant. Without that domain knowledge, a tool will have to actually compute it. On the other side of the coin, there's no way such a high level AD tool would do all the drudgery of invariant code motion, or even lower level scheduling/register allocation (and see Enzyme paper for reasons why these can be really useful optimizations for AD).

In an ideal world you want to combine all this together and have AD done in part whenever there's some amount of meaningful optimization (and ideally remove abstraction barriers like say a black box call to cudnn). We demonstrate this high and low level AD in a minimal test case against Zygote [high level Julia AD], replacing a scalar code which is something Zygote is particularly bad at. This thus enables both the high level algebraic transformations of Zygote and the low level scalar performance of Enzyme, which is what you'd really want to do.

It looks like the discussion of this has dropped off for now, but I'm sure shoyer would be able to do a much better job of listing interesting high-level tricks JaX does [and perhaps low level ones it misses] as a consequence of its choice of where to live on the abstraction spectrum.

Also thanks for reminding me about matrix decomposition, I actually think there's a decent chance of doing that somewhat nicely at a low level from various loop analyses, but I got distracted by a large fortran code for nuclear particles.

wsmoses | 5 years ago | on: Enzyme – High-performance automatic differentiation of LLVM

Regarding differentiating python via CPython, theoretically yes, though practically it is likely more wise to use something like Numba which takes Python to LLVM directly to avoid a bunch of abstraction overhead that would otherwise have to be differentiated through. Also fun fact JaX can be told to simply emit LLVM and we've used that as an input for tests :)

You can explicitly define custom gradients by attaching metadata to the function you want to have the custom gradient (and Enzyme will use that even if it could differentiate the original function).

Integral types: mayyybe, depending what exactly you mean. I can imagine using custom gradient definitions to try specifying how an integral type can be used in a differentiable way (say representing a fixed point). We don't support differentiating integral types by approximating them as continuous values if that's what you're asking. There's no reason why we couldn't add this (besides perhaps bit tricks being annoying to differentiate), but haven't come across a use case.

wsmoses | 5 years ago | on: Enzyme – High-performance automatic differentiation of LLVM

We go into more details in the Limitations section of the paper, but in short Enzyme requires the following properties:

* IR of active functions must be accessible when Enzyme is called (e.g. cannot differentiate dlopen'd functions)

* Enzyme must be able to deduce the types of operations being performed (see paper section on interprocedural type analysis for details why)

* Support for exceptions is limited (and running with -fno-exceptions, equivalent in a diff language, or LLVM's exception lowering pass removes these).

* Support for parallel code (CPU/GPU) is ongoing [and see the prior comment on GPU parallelism for details]

wsmoses | 5 years ago | on: Enzyme – High-performance automatic differentiation of LLVM

Whoops added one too many zero’s there, agreed that would be really nice :P

wsmoses | 5 years ago | on: Enzyme – High-performance automatic differentiation of LLVM

Enzyme does indeed handle mutable arrays (both in Enzyme.jl and any other frontend)! If you want to try it out forewarned that we're currently upgrading Enzyme.jl for better JIT integration (dynamic re-entry, custom derivative passthrough, nicer garbage collection) so there may be some falling debris.

wsmoses | 5 years ago | on: Enzyme – High-performance automatic differentiation of LLVM

Yeah my best guess at that is that they were trying to say you'd only need to store one value: the sum, rather than the two individual values -- but I'm not completely sure.

wsmoses | 5 years ago | on: Enzyme – High-performance automatic differentiation of LLVM

Say you have some existing virus simulation codebase that you want to use ML on to derive an effective policy on. Without an AD tool like Enzyme, you'd have to spend significant time and effort understanding and rewriting that obnoxious 100K lines of fortran into TensorFlow, when you could've been spending it solving your problem. The reason you need to do this rewriting is because many ML algorithms require the derivatives of functions to be able to use them and Enzyme provides an easy way to generate derivatives of existing code.

This is also useful in the scientific world where derivatives of functions are commonplace.

You could also use it in more performance-engineering/computer systems ways as well by using the derivatives to perform uncertainty quantification and perhaps decide to use 32-bit floats rather than 64-bit doubles.

wsmoses | 5 years ago | on: Enzyme – High-performance automatic differentiation of LLVM

Reverse mode AD can always get into situations where it needs to store original values (i.e. network state).

One advantage, however, of doing a more whole-program approach to AD rather than individual operators is that one might be able to avoid caching values unnecessarily. For example if an input isn't modified (and still exists) by the time the value is needed in the reverse pass, you don't need to cache it but can simply use the original input without a copy.

And yes PyTorch/TF tend to perform a (limited) form of AD as well, rather than do numerical differentiation (though I do think there may be an option for numerical?)

I wouldn't really position a tool like Enzyme as a competitor to PyTorch/TF (they may have some better domain-specific knowledge after all), but rather a really nice complement. Enzyme can take derivatives of arbitrary functions, in any LLVM-based language rather than the DSL of operators supported by PyTorch/TF. In fact, we built a plugin for PyTorch/TF that uses Enzyme to import custom foreign code as a differentiable layer!

wsmoses | 5 years ago | on: Enzyme – High-performance automatic differentiation of LLVM

Enzyme needs to be able to access the IR of any potentially active functions (calls that it deduced could impact the gradient) to be able to differentiate them.

If all of the code you care about is in one compilation unit, you're immediately good to go.

Multiple compilation units can be handled in a couple of ways, depending on how much energy you want to set it up (and we're working on making this easier).

The easiest way is to compile with Link-Time Optimization (LTO) and have Enzyme run during LTO, which ensures it has access to bitcode for all potentially differentiated functions.

The slightly more difficult approach is to have Enzyme ahead-of-time rather than lazily emit derivatives for any functions you may call in an active way (and incidentally this is where Enzyme's rather aggressive activity analysis is super useful). Leveraging Enzyme's support for custom derivatives in which an LLVM function declaration can have metadata that marks its derivative function, Enzyme can then be told to use the "custom" derivatives it generated while compiling other compilation units. This obviously requires more setup so I'm usually lazy and use LTO, but this can definitely be made easier as a workflow.

wsmoses | 5 years ago | on: Enzyme – High-performance automatic differentiation of LLVM

You don't always need the input to compute the gradient. For example the gradient of a sum function doesn't require the original input, it just sets all of the derivative(input)'s to 1.

wsmoses | 5 years ago | on: Enzyme – High-performance automatic differentiation of LLVM

I think in essence what PartiallyTyped is trying to say is that one potential optimization opportunity in whole-program AD is that you can avoid having to cache the original inputs of the program if you know that derivative computation won't need it (e.g. its only used in a sum and not a product or something whose derivative depends on the value). Some ML frameworks must cache all of the inputs to an operation since they don't know whether it will be necessary for the reverse pass of an operation. You could go even further and decide to cache a different & smaller set of intermediate values that still lets you compute the gradient.

Beyond cache reduction, in our paper we demonstrate a lot of interesting ways that combining AD with a compiler can create potential speed-up. For example, we are often able to dead-code eliminate part of the original forward-pass code since it's not needed to compute the gradient.

We also have a cool example in the paper showing an asymptotic [O(N^2) => O(N)] speedup on a code for normalizing a vector because doing AD in the compiler lets Enzyme run after optimization (and in that example benefit from loop invariant code motion in a way that tools that aren't in the compiler cannot do).

wsmoses | 5 years ago | on: Enzyme – High-performance automatic differentiation of LLVM

Oh man that was a fun hack to write. Basically we demonstrated an easy-to-setup AD on rust by leveraging link-time optimization (LTO) as a way to make sure Enzyme's generate derivatives "optimization pass" was run.

We're currently working with the Rust ML infrastructure group to make a nice integration of Enzyme into Rust (e.g. nice type-checking, safety, etc). If you're interested in helping, you should join the Rust ML meetings and/or Enzyme weekly meetings and check out https://github.com/rust-ml/Meetings and https://github.com/tiberiusferreira/oxide-enzyme/tree/c-api . There's a bunch of interesting optimizations and nicer UX for interoperability we want to add so more manpower is really helpful!

The most interesting thing from the Rust standpoint is that ideally we'd want Enzyme to be loaded into the Rust compiler as a plugin (much like it is for Julia, Clang for C/C++, etc) -- but Rust doesn't support the option for that yet. This means we can either help push for plugins/custom codegen in Rust, make script-based compilation tools within rustc [I don't remember the specific name but someone who is more of a Rust expert I'm sure can chime in], or do the sketchy LTO approach above [not always desirable as it requires running LTO].

Alternatively Enzyme can just become part of LLVM mainline so everyone can use it without a plugin :P We're not quite there yet but we're in the process of becoming a formal LLVM incubator project!

wsmoses | 5 years ago | on: Enzyme – High-performance automatic differentiation of LLVM

For GPU's, there's a couple of different things that you might want to do.

You can use existing tools within LLVM to automatically generate GPU code out of existing code, and this works perfectly fine, even running Enzyme first to synthesize the derivative.

You can also consider taking an existing GPU kernel and then automatically differentiating it. We currently support a limited set of cases for this (certain CUDA instructions, shared memory etc), and are working on expanding as well as doing performance improvements. AD of existing general GPU kernels is interesting [and more challenging] since racey reads in your original code become racey writes in the gradient -- which must have extra care taken to make sure they don't conflict. To my knowledge GPU AD on general programs (e.g. not a specific code) really hasn't been done before, so it's a fun research problem to work on (and if someone knows of existing tools for this please email me at wmoses at mit dot edu).

wsmoses | 5 years ago | on: Enzyme – High-performance automatic differentiation of LLVM

Adding onto this, numerical derivatives have two potential problems which is why they tend not to be used in big scientific/ML frameworks.

First of all they suffer from accuracy decay. For example if you were to do the standard f'(x) \approx [f(x+h)-f(x)]/h, you'd subtract two similar numbers and waste many bits of precision. In contrast if you were to generate the derivative function directly like below, you'd end up far more accurate.

double square(double x) { return x * x; }

double d_square(double x) { return __enzyme_autodiff(square, x); }

becomes

double d_square(double x) { return 2 * x; }

Secondly, from a performance perspective numerical differentiation is really slow -- especially for gradient computation. For example you would need to evaluate the function at once per argument in numeric differentiation to get the whole gradient. In contrast, reverse mode AD lets you generate the entire gradient in one call.

In addition to these generic issues, we illustrate in our paper how doing this at a compiler level allows for significant additional optimization (by removing unnecessary code from the forward pass, finding common expressions, etc).

These issues also are amplified as you make higher-order derivatives and so on.

wsmoses | 5 years ago | on: Enzyme – High-performance automatic differentiation of LLVM

Enzyme is named such as it's a tool that "synthesizes derivatives" and also as a pun referencing Zygote (another AD tool) since Enzyme operates at a lower level (LLVM rather than Julia).

wsmoses | 5 years ago | on: Enzyme – High-performance automatic differentiation of LLVM

Hi all, another author here and happy to answer any questions!

Some more relevant links for the curious

Github: https://github.com/wsmoses/Enzyme

Paper: https://proceedings.neurips.cc/paper/2020/file/9332c513ef44b...

Basically the long story short is that Enzyme has a couple of interesting contributions:

1) Low-level Automatic Differentiation (AD) IS possible and can be high performance

2) By working at LLVM we get cross-language and cross-platform AD

3) Working at the LLVM level actually can give more speedups (since it's able to be performed after optimization)

4) We made a plugin for PyTorch/TF that uses Enzyme to import foreign code into those frameworks with ease!

wsmoses | 5 years ago | on: Enzyme: Cross-language Automatic differentiation for LLVM IR

Hi all, author here.

A couple of relevant links for the curious

Github: https://github.com/wsmoses/Enzyme

Paper: https://proceedings.neurips.cc/paper/2020/file/9332c513ef44b...

Project: enzyme.mit.edu

Basically the long story short is that Enzyme has a couple of really interesting contributions:

1) Low-level AD IS possible and can be high performance

2) By working at LLVM we get cross-language and cross-platform AD

3) Working at the LLVM level actually can give more speedups (since it's able to be performed after optimization)

4) We made a plugin for PyTorch/TF that uses Enzyme to import foreign code into those frameworks with ease!

page 1