top | item 21355198

(no title)

craffel | 6 years ago

It actually can be more pernicious than that: https://arxiv.org/abs/1802.08232

However note that the dataset used to train GPT-2 is about 20x smaller than C4. I'm not 100% sure how many times the training set was repeated over the course of training for GPT-2, but it was likely many times. I stand by my statement (that memorization is unlikely with SGD and no repetition of training data) but I would be happy to be proven otherwise.

discuss

No comments yet.