top | item 40484728

(no title)

Great point, the sigmoid approximation works well for certain problems and that's in fact what I used in the exploratory papers that lead to this work. The downsides are the lack of a clear interpretation how the original program and its smooth counterpart are related, and the difficulty of controlling the degree of smoothing as programs get longer. What DiscoGrad computes has a statistical interpretation: it's the convolution of the program output with whatever distribution is used for smoothing, typically a Gaussian with a configurable variance.

On top of that, if the program branches on random numbers (which is common in simulations), that suffices for the maths to work out and you get an estimate of the asymptotic gradient (for samples -> infinity) of the original program, without any artificial smoothing.

So in short, I do think it is slightly fancier :)

discuss

szvsw|1 year ago

I’ve also used the sigmoid approximation to relax some problems (specifically for step changes in field properties in PINN problems) into continuous variables, cool to see this discussion here from a different perspective! Slightly off topic but the only other things I’m aware of that are vaguely related are things like the gumbel-softmax trick for making sampling from categorical distributions differentials or the Gaussian reparam trick (eg as used in VAEs). I’m curious if this is at least somewhat related to the approach taken in your work, in spirit if not in technical implementation?

frankling_|1 year ago

Yeah, those tricks are highly related to what we do, the main difference being that we don't require a priori information about the distributions involved in the program. Instead, we compute a density estimation of the distribution of branch conditions at runtime, which makes things quite general and fully automatic, at the cost of some accuracy.

As an aside, the combination "known distributions + automation" is covered in the Julia world by stochasticAD (https://github.com/gaurav-arya/StochasticAD.jl).

usgroup|1 year ago

Does that imply that every possible branch is optimised separately and then convoluted thereafter?

If so, does it scale for very branchy programs?

Do you have any comparisons to a Gibbs based approach for any of the use case examples?

frankling_|1 year ago

The convolution is approximated via a form of sampling with additional bookkeeping at each encountered branch. How well that scales for deeply branching programs depends on the probabilities of the branches and the diversity in the output on the resulting paths, the worst case being a program where all branches are equally likely and each path generates an entirely different output (as in a hash function, for example). In practice, we've been dealing with problems involving up to tens of thousands of branches or so.

We've haven't done a direct comparison to MCMC approaches yet, but it's on the Todo list. My intuition is that MCMC will win out for problems where finding "just any" local optimum is not good enough.