top | item 37386663

(no title)

One noteworthy thing is that no one is posting validation curves, only training curves. All these models will happily bring training loss eventually to near zero with infinite compute, as the model overfits to the dataset -- there are no regularizers in any modern LLMs. The validation curves would be considerably more convincing.

The counter argument to above is that none of these models were really trained for multiple-epochs: it's hard to overfit data you've only seen once. But to go to 70T tokens, you'd inevitably have to start using many epochs.

discuss

Straw|2 years ago

The validation curves will look identical. These models are far too small to overfit to the training set.

With a large enough model and many epochs, you can certainly get overfitting, but for one epoch val/train curves look exactly the same and I'd expect that a 7B model will never overfit on 2T tokens no matter how many epochs you do.

haldujai|2 years ago

> data you've only seen once

Is this still true given that they're upsampling in the pretraining dataset? I don't recall any details on how and to what extent they did this in the Llama2 paper but presumably some fraction of those 2T training tokens is repeated data.

MetaAI hasn't been as averse to repeated tokens as other groups, they trained the now forgotten about Galactica for multiple epochs with good results.

> The validation curves would be considerably more convincing.

What are they validating on? I was under the impression they weren't splitting the pretraining corpus.

stephenroller|2 years ago

The llama1 team did not have a validation set. I don’t know what the Llama2 team did - I left before seeing any of the details.

My guess is Llama2 upsamples Wikipedia a good bit, but given they didn’t report any information about training data, it’s hard to say.

visarga|2 years ago

> there are no regularizers in any modern LLMs.

Using a large & diverse training set is the best regulariser, but I think there is also weight decay and dropout in transformers

euclaise|2 years ago

RWKV also uses some sort of L2-esque regularization, which was supposedly an idea taken from PaLM (although I can't find a source on this point, other than some message in the RWKV discord)