top | item 40497324

(no title)

tehsauce | 1 year ago

Grokking is a sudden huge jump in test accuracy with increasing training steps, well after training accuracy has fully converged. Double descent is test performance increasing, decreasing, and then finally rising again as model parameters are increased.

discuss

scarmig|1 year ago

What they share is a subversion of the naive framework that ML works simply by performing gradient descent over a loss landscape. Double descent subverts it by showing that learning isn't monotonic in parameter count; grokking subverts it by learning after training convergence.

I'd put the lottery ticket hypothesis in the same bucket of "things that may happen that don't make sense at all for a simple optimization procedure."

baq|1 year ago

My takeaway from the paper is that you can guide training by adding/switching to a more difficult loss function after you got the basics right. Looks like they never got to overfitting grokking, so maybe there’s more to discover further down the training alley.