(no title)
cocomutator | 1 year ago
I'm struggling to put a finger on it, but it feels that the approach in the blog post has the property that it finds the _minimum_ complexity solution, akin to driving the regularization strength in conventional ml higher and higher during training, and returning the solution at the highest such regularization that does not materially degrade the error (epislon in their paper). information theory plays the role of a measuring device that allows them to measure the error term and model complexity on a common scale, so as to trade them off against each other in training.
I haven't thought about it much but i've seen papers speculating that what happens in double-descent is finding lower complexity solutions.
No comments yet.