top | item 45118931

(no title)

ngriffiths | 5 months ago

> Therefore, despite the insanely large number of adjustable parameters, general solutions, that are meaningful and predictive, can be found by adding random walks around the objective landscape as a partial strategy in combination with gradient descent.

Are there methods that specifically apply this idea?

I guess this is a good explanation for why deep learning isn't just automatically impossible, because if local minima were everywhere then it would be impossible. But on the other hand, usually the goal isn't to add more and more parameters, it's to add just enough so that common features can be identified but not enough to "memorize the dataset." And to design an architecture that is flexible enough but is still quite restricted, and can't represent any function. And of course in many cases (especially when there's less data) it makes sense to manually design transformations from the high dimensional space to a lower dimensional one that contains less noise and can be modeled more easily.

The article feels connected to the manifold hypothesis, where the function we're modeling has some projection into a low dimensional space, making it possible to model. I could imagine a similar thing where if a potential function has lots of ridges, you can "glue it together" so all the level sets line up, and that corresponds with some lower dimensional optimization problem that's easier to solve. Really interesting and I found it super clearly written.

discuss

tech_ken|5 months ago

> Are there methods that specifically apply this idea?

Stochastic gradient descent is basically this (not exactly the sane, but the core intuitions align IMO). Not exactly optimization but Hamiltonian MCMC also seems highly related.

> I could imagine a similar thing where if a potential function has lots of ridges, you can "glue it together" so all the level sets line up, and that corresponds with some lower dimensional optimization problem that's easier to solve.

Excellent intuition, this is exactly the idea of HMC (as far as I recall); the concrete math behind this is (IIRC) a "fiber bundle".

evanb|5 months ago

HMC was essentially designed to mix random walks (the momentum refresh step) with gradient descent (that is, the state likes to 'roll down the potential' ie. minimize the action (loss)).