top | item 40481578

Show HN: Boldly go where Gradient Descent has never gone before with DiscoGrad

232 points| frankling_ | 1 year ago |github.com

Trying to do gradient descent using automatic differentiation over branchy programs? Or to combine them with neural networks for end-to-end training? Then this might be interesting to you.

We develped DiscoGrad, a tool for automatic differentiation through C++ programs involving input-dependent control flow (e.g., "if (f(x) < c) { ... }", differentiating wrt. x) and randomness. Our initial motivation was to enable the use of gradient descent with simulations, which often rely heavily on such discrete branching. The latter makes plain autodiff mostly useless, since it can only account for the single path taken through the program. Our tool offers several backends that handle this situation, giving useful descent directions for optimization by accounting for alternative branches. Besides simulations, this problem arises in many other places, for example in deep learning when trying to combine imperative programs with neural networks.

In a nutshell, DiscoGrad applies an (LLVM-based) source-to-source transformation to your C++ program, adding some calls to our header library, which then handles the gradient computation. What sets it apart from similar tools/estimators is that it's fully automatic (no need to come up with a differentiable problem formulation/reparametrization) and that the branching condition can be any function of the program inputs (no need to know upfront what distribution the condition follows).

We're currently a team of two working on DiscoGrad as part of a research project, so don't expect to see production-grade code quality, but we do intend for it to be more than a throwaway research prototype. Use cases we've successfully tested include calibrating simulation models of epidemics or evacuation scenarios via gradient descent, and combining simulations with neural networks in an end-to-end trainable fashion.

We hope you find this interesting and useful, and we're happy to answer questions!

66 comments

eranation|1 year ago

Laymen question - I sort of understand what gradient descent is, but I'm not sure I fully understand what DiscoGrad is doing, my probably incorrect naive understanding is: to find optimal params for a program by converting the branches of a program into something that resembles a "smooth" loss function so a tradition gradient descent algorithm can find local minima and suggest the optimal params / weights?

EDIT: removed part of the question that is answered in the article.

frankling_|1 year ago

Yep, that's exactly it. The smoothness can either come from randomness in the program itself (then the objective function is asymptotically smooth and DiscoGrad estimates the gradient of that smooth function), or the smoothness can be introduced artificially.

As an example, the very first thing we looked into was a transportation engineering problem, where the red/green phases of traffic lights lead to a non-smooth optimization problem. In essence, in that case we were looking for the "best possible" parameters for a transportation simulation (in the form of a C++ program) that's full of branches.

PeterHolzwarth|1 year ago

You mean besides the examples of use cases mentioned in the linked page?

vessenes|1 year ago

When all you have is a hammer, ... you split up your house into wood and nails, and reduce it to a previously solved problem!

In all seriousness, this is super interesting. I really like the idea of implementing gradient descent solving branch by branch, and turning it into an optimization-level option for codebases.

I feel like this would normally be something commercialized by Intel's compiler group; it's hard for me to know how to get this out more broadly -- it would probably need to be standardized in some way?

Anyway, thanks for working on it and opening it up -- very cool. Needs more disco balls.

frankling_|1 year ago

Thanks for the kind words! We'd be super happy if this work gets picked up, whether in a commercial context or not.

We were thinking of some disco ball-based logo (among some other designs). With this encouragement, there'll probably be an update in the next few days :)

dwrodri|1 year ago

This is the sort of thing I expected to see when Chris Lattner moved to Google and started working on the Swift for Tensorflow project. I am so grateful that someone is making it happen!

I remember being taught how to write Prolog in University, and then being shown how close the relationship was between building something that parses a grammar and building something that generates valid examples of that grammar. When I saw compiler/language level support for differentiation, I the spark went off in my brain the same way: "If you can build a program which follows a set of rules, and the rules for that language can be differentiated, could you not code a simulation in that differentiable language and then identify the optimal policy using it's gradients?"

Best of luck on your work!

justinnk|1 year ago

Thanks! You may find DeepProbLog by Manhaeve et al. interesting, which brings together logic programming, probabilistic programming and gradient descent/neural networks. Also, more generally, I believe in the field of program synthesis there is some research on deriving programs with gradient descent. However, as also pointed out in the comment below, gradient descent may not always be the best approach to such problems (e.g., https://arxiv.org/abs/1608.04428).

YeGoblynQueenne|1 year ago

>> "If you can build a program which follows a set of rules, and the rules for that language can be differentiated, could you not code a simulation in that differentiable language and then identify the optimal policy using it's gradients?"

What's a "policy" here? In optimal control (and reinforcement learning) a policy is a function from a set of states to a set of actions, each action a transition between states. In a program synthesis context I guess that translates to a function from a set of _program_ states to a set of operations?

What is an "optimal" policy then? One that transitions between an initial state and a goal state in the least number of operations?

With those assumptions in place, I don't think you want to do that with greadient descent: it will get stuck in local minima and fail in both optimality and generalisation.

Generalisation is easier to explain. Consider a program that has to traverse a graph. We can visualise it as solving a maze. Suppose we have two mazes, A and B, as below:

        A               B
  S □ □ ■ □ □ □   S □ □ ■ □ □ □ 
  ■ ■ □ ■ □ ■ □   ■ ■ □ ■ □ ■ □ 
  □ ■ □ ■ □ ■ □   □ ■ □ ■ □ ■ □ 
  □ ■ □ ■ ■ ■ □   □ ■ □ ■ ■ ■ □ 
  □ ■ □ ■ □ □ □   □ ■ □ ■ □ □ □ 
  □ ■ □ ■ □ ■ □   □ ■ □ ■ □ ■ □ 
  □ □ □ □ □ ■ E   E □ □ □ □ ■ □

Black squares are walls. Note that the two mazes are identical but the exit ("E") is in a different place. An optimal policy that solves maze A will fail on maze B and v.v. Meaning that for some classes of problem there is no policy that is optimal for the every instance in the class and finding an optimal solution requires computation. You can't just set some weights in a function and call it a day.

It's also easy to see what classes of problems are not amenable to this kind of solution: any decision problem that cannot be solved by a regular automaton (i.e. one that is no more than regular). Where there's branching structure that introduces ambiguity -think of two different parses for one string in a language- you need a context-free grammar or above.

That's a problem in Reinforcement Learning where "agents" (i.e. policies) can solve any instance of complex environment classes perfectly, but fail when tested in a different instance [1].

You'll get the same problem with program synthesis.

___________

[1] This paper:

Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability

https://arxiv.org/abs/2107.06277

makes the point with what felt like a very convoluted example about a robotic zoo keeper looking for the otter habitat in a new zoo etc. I think it's much more obvious what's going on when we study the problem in a grid like a maze: there are ambiguities and a solution cannot be left to a policy that acts like a regular automaton.

usgroup|1 year ago

Not really. The world of Bayesian modelling has much fancier tools: Hamiltonian MC. See MC Stan. There’s also been Gibbs samplers and other techniques which support discrete decisions for donkeys years.

You can write down just about anything as a BUGS model for example, but “identifying the model” —- finding the uniquely best parameters, even though it’s a globally optimisation —- is often very difficult.

Gradient descent is significantly more limiting than that. Worth understanding MC. The old school is a high bar to jump.

szvsw|1 year ago

Awesome! Several of my colleagues are working on differentiable physics simulations (mostly FEM type stuff for structural design optimization) so I’m excited to share this with them! They mostly work in Julia. My own experiments with auto-diff’d physics sims have been in Python (specifically, Taichi for the JIT/GPU acceleration or occasionally PyTorch/Jax).

Can you talk a little bit about the challenges of bringing something like what you’ve implemented to existing autograd engines/frameworks (like the ones previously mentioned)? Are you at all interested in exploring that as a mechanism for increasing access to your methodology? What are your thoughts on those autodiff engines?

frankling_|1 year ago

We actually did some preliminary experiments with Taichi hoping to benefit from the GPU parallelization. I think generally, the world of autodiff tooling is in very good shape. For anything non-exotic, we just use JAX or Torch to get things done quickly and with good performance.

Generally, integrating the ideas behind DiscoGrad into existing frameworks has been on our mind since day one, and the C++ implementation represents a bit of a compromise made to have a lot of flexibility during development while the algorithms were still a moving target, and good performance (albeit without parallelization and GPU support as of yet). Based on DiscoGrad's current incarnation, however, it should not be terribly hard to, say, develop a JAX+DiscoGrad fork and offer some simple "branch-like" abstraction. While we've been looking into this, it can be a bit tricky in a university context to do the engineering leg work required to build something robust...

avibryant|1 year ago

Do you have any links to your experiments and/or those of your colleagues?

pavlov|1 year ago

Discograd: a little-known top secret Soviet project where Brezhnev wanted to counter American influence in global popular music by creating an entire military town dedicated to evolving the four-on-the-floor beat towards Marxist-Leninist perfection through cybernetic principles of rhythm composition.

(After 1991 Discograd was demilitarized and renamed Grungetown to attract foreign investments.)

elpocko|1 year ago

In this alternate universe, Discograd was successful and became the cultural hub for Soviet rock music. It influenced bands like Kino, Aquarium and DDT who incorporated Marxist-Leninist themes into their music. Disco beats were indeed evolved but in a way that resonated with the ideology of the time. The four-on-the-floor beat became more intricate and complex, incorporating elements of jazz and folk music from various Soviet republics.

In this reality, Discograd hosted the first Soviet Rock Festival, which was attended by thousands of enthusiastic fans from all over the USSR. The festival featured performances by bands that were formed and nurtured in Discograd, showcasing a new genre: Proletrock – a unique fusion of disco, rock, jazz and Soviet folk music, with lyrics promoting socialist values and workers’ rights.

Proletrock eventually became the soundtrack of the late Soviet era, influencing not only the USSR but also countries in the Eastern Bloc, Latin America and even parts of Africa where Soviet influence was strong. The genre helped to spread communist ideology through catchy beats and thought-provoking lyrics, making Discograd an integral part of music history.

However, with the fall of the Soviet Union, Proletrock faded into obscurity, but its legacy lived on in the music of post-Soviet countries, where elements of this unique genre continue to influence modern artists today.

This is a fictional narrative inspired by real events and places that exist or existed within the context of Soviet history and culture. It serves as a creative exploration of what could have been if the USSR had pursued such an ambitious project with the same fervor it dedicated to its space program.

(WizardLM-2-7B)

usgroup|1 year ago

I used to replace strict Boolean conditionals with sigmoids in order to achieve continuous transfer for Bayesian change-point models.

Does this do something similar or is it fancier?

frankling_|1 year ago

Great point, the sigmoid approximation works well for certain problems and that's in fact what I used in the exploratory papers that lead to this work. The downsides are the lack of a clear interpretation how the original program and its smooth counterpart are related, and the difficulty of controlling the degree of smoothing as programs get longer. What DiscoGrad computes has a statistical interpretation: it's the convolution of the program output with whatever distribution is used for smoothing, typically a Gaussian with a configurable variance.

On top of that, if the program branches on random numbers (which is common in simulations), that suffices for the maths to work out and you get an estimate of the asymptotic gradient (for samples -> infinity) of the original program, without any artificial smoothing.

So in short, I do think it is slightly fancier :)

Loic|1 year ago

You should take a look at the Tapenade[0] automatic differentiation engine, more than 20 years of active development for Fortran and C.

[0]: https://team.inria.fr/ecuador/en/tapenade/

zaitanz|1 year ago

So this confuses me slightly and I am keen to know the advantage of using this. I work on projects that heavily use auto-differentiation for complex models. The models are defined by user input files at run-time, so the state and execution pathway of the model is unknown at compilation time. Would this help?

Given that all auto-differentiation is an approximation anyways. I've found existing tooling (CppAD, ADMB, ADOL-C, Template Model Builder (TMB)) work fine. You don't need to come up with a differentiable problem or re-parameterize. Why would I pick this over any of those?

big-chungus4|1 year ago

In `if i > x`, derivative with respect to x is mathematically 0 at all points. DiscoGrad gives you a useful smooth approximation that is not 0 and lets the function learn those conditional values.

sundalia|1 year ago

This is very interesting. A few questions:

- Why do you think similar approaches never landed on jax? My guess is this is not that useful for the current optimizations in fashion (transformers)

- How would you convince jax to incorporate this?

frankling_|1 year ago

Well, the most common ML problems can be expressed as optimization over smooth functions (or reformulated that way manually). We might have to convince the ML world that branches do matter :) On the other hand, there are gradient-free approaches that solve problems with jumps in other ways, like many reinforcement learning algorithms, or metaheuristics such as genetic algorithms in simulation-based optimization. The jury's still out on "killer apps" where gradient descent can outperform these approaches reliably, but we're hoping to add to that body of knowledge...

radarsat1|1 year ago

> Why do you think similar approaches never landed on jax?

Isn't this just adding noise to some branching conditions? What would take for a framework like Jax to "support" it, it seems like all you have to do is change

> if (x>0)

> if (x+n > 0)

where n is a sampled Gaussian.

Not sure this warrants any kind of changes in a framework if it's truly that trivial.

HarHarVeryFunny|1 year ago

What do you mean by plain autodiff being mostly useless with normal/discrete branching? Wouldn't branches normally just be "ignored" by autodiff - each training sample being a different computational graph (but with parts in common) due to branching points, so the only effect of branching is which computational graph gets executed and backpropagated through?

What's the general type of use case where this default behavior is useless, and "non-discrete" (stochastic?) branching helps?

frankling_|1 year ago

That's right, plain autodiff just ignores branches. Our canonical "why is this even needed" example is a program like "if (x >= 0) return 1; else return 0", x being the input.

The autodiff derivative of this is zero, wherever you evaluate it, so if you sample x and run your program on each x as in a classical ML setup, you'd be averaging over a series of zero-derivatives. This is of course not helpful to gradient descent. In more complex programs, it's less blatant, but the gist is that just averaging sampled gradients over programs (input-dependent!) branches yields biased or zero-valued derivatives. The traffic light optimization example shown on Github is a more complex example where averaged autodiff-gradients are always zero.

andyferris|1 year ago

Pretty cool. It is somewhat like Julia’s autodiff ecosystem - which also handles native, branching code.

big-chungus4|1 year ago

Julia's autodiff packages, as well as PyTorch, can differentiate through branching code -the gradients simply flow through whatever branch was used in forward pass. However, derivatives with respect to conditional values, such as `a` in `if i > a`, are mathematically zero. If you plot a graph of how function value depends on a conditional `if i > a`, it is flat with a single drop when `i` becomes bigger than `a`. DiscoGrad, on the other hand, doesn't use true mathematical derivatives, instead it calculates useful, smoothed gradient approximations for those conditionals.

casualscience|1 year ago

I'm confused as to the use cases for this? Are you saying if I want to fit some "magic numbers" in my cpp program, I can now do that by pulling in discograd and wrapping those numbers with some code that says "please fit these", then adding some test cases somewhere?

justinnk|1 year ago

(I am one of the authors) Thanks for your question. Yes, similar to what you describe but not quite. The prime use case is to apply DiscoGrad together with a gradient descent optimizer to optimization problems. For a C++ program to be regarded as such, you have to define what the "inputs" are and the program has to return some numerical value (loss) that is to be maximized/minimized. The tool then delivers a "direction" (smoothed gradient), which gradient descent can use to adjust the inputs toward a local optimum.

So if you can express your test cases in a numerical way and make the placeholders for the "magic numbers" visible to the tool by regarding them as "inputs" (which should generally be possible), this may be a possible use-case. Hope this clarifies it.

jey|1 year ago

No, the use cases for this are similar to regular autodiff, where you implement a function f(x) and the library helps you automatically compute derivatives such as the gradient g(x) := ∇f(x). Various autodiff methods differ in how they accomplish this, and the library shared here uses a code-generation approach where it performs a source-to-source transformation to generate source code for g(x) based on the code for f(x).

brap|1 year ago

Somewhat related: is there autograd in python that uses AST analysis or something similar? The only method I’m familiar with uses tracer objects which have their gotchas. I’d like to compile a python function (at runtime) while writing its code as naturally as possible

szvsw|1 year ago

Taichi (Python package) definitely uses AST for at least the JIT’ing/kernel generation, not sure about how the autograd works under the hood but these links will get you started. There are plenty of publications about taichi which you can look up as well for more detail I am sure.

https://docs.taichi-lang.org/docs/differentiable_programming

https://docs.taichi-lang.org/docs/compilation

memming|1 year ago

Nice! Similar to what spiking neural network people have been dealing with. E.g. https://github.com/fzenke/spytorch

esafak|1 year ago

Wouldn't replacing the flow control statements with ML models slow it down too much? Do you have the ability to automatically estimate the appropriate model complexity for a given statement based on how hot it is?

frankling_|1 year ago

We're doing something less expensive: essentially, the overall gradient is computed based on certain statistics based on the branch condition and its derivatives when a branch is encountered.

We mention neural networks because DiscoGrad lets you combine branching programs with neural networks (via Torch) and jointly train/optimize them.

boywitharupee|1 year ago

how would you compare this to the polytope model? https://en.wikipedia.org/wiki/Polytope_model

frankling_|1 year ago

Not super closely related: the polytope model (to the degree I'm familiar with it) is used as a representation that facilitates optimization of loop nests. That's optimization in the sense of finding an efficient program.

DiscoGrad deals with (or provides gradients for) mathematical optimization. In our case, the goal is to minimize or maximize the program's numerical output by adjusting it's input parameters. Typically, your C++ program will run somewhat slower with DiscoGrad than without, but you can now use gradient descent to quickly find the best possible input parameters.

ipunchghosts|1 year ago

Talk about the differences between this and ceres http://ceres-solver.org/.

frankling_|1 year ago

The key point is that Ceres requires derivatives, which can come from manually derived formulae, approximations via finite differences, or autodiff (http://ceres-solver.org/derivatives.html). DiscoGrad doesn't do the optimization itself (for that, we use gradient descent, for example via Adam), but essentially represents a fourth option to obtain derivatives, and one which captures the branches in an optimization problem (which autodiff doesn't).

While I'm not super familiar with the typical use cases for Ceres, the gradient estimator from DiscoGrad could possibly be integrated to better handle branchy problems.

s_tim|1 year ago

How does this compare to Enzyme (https://enzyme.mit.edu/)?

frankling_|1 year ago

Enzyme is traditional, but super duper optimized, autodiff, that is, it returns the partial derivatives for one path taken through the program, ignoring other branches. DiscoGrad captures the effects of alternative branches. What's special about enzyme is that the gradient computations benefit from LLVM's optimization passes and language support.