(no title)
virtualbluesky | 2 years ago
1. Gradient descent is path-dependent and doesn't forget the initial conditions. Intuitively reasonable - the method can only make local decisions, and figures out 'correct' by looking at the size of its steps. There's no 'right answer' to discover, and each initial condition follows a subtly different path to 'slow enough'...
because...
2. With enough simplification the path taken by each optimization process can be modeled using a matrix (their covariance matrix, K) with defined properties. This acts as a curvature of the mathematical space, and has some side-effects like being able to use eigen-magic to justify why the optimization process locks some parameters in place quickly, but others take a long time to settle.
which is fine, but doesn't help explain why wild over-fitting doesn't plague high-dimensional models (would you even notice if it did?). Enter implicit regularization, stage left. And mostly passing me by on the way in, but:
3. Because they decided to use random noise to generate the functions they combined to solve their optimization problem there is an additional layer of interpretation that they put on the properties of the aforementioned matrix that imply the result will only use each constituent function 'as necessary' (i.e. regularized, rather than wildly amplifying pairs of coefficients)
And then something something baysian, which I'm happy to admit I'm not across
quickthrower2|2 years ago