(no title)
duchenne | 2 years ago
This paper shows how the loss decreases when you increase the model size, compute, or training dataset size.
From the article:
> Convergence is inefficient: When working within a fixed compute budget C but without any other restrictions on the model size N or available data D, we attain optimal performance by training very large models and stopping significantly short of convergence.
It clearly states that when you are limited by your training time compute, you should under-train your model.
swyx|2 years ago
highfrequency|2 years ago