(no title)
ppaattrriicckk | 2 years ago
Not outright the most efficient utilization of data, but as others have mentioned, there might still be something left to gain (from not solely optimizing for compute). E.g. look at Figure 9 in [3]. Although the model trained on the largest model obviously utilized the data most efficiently, it's not perfectly clear - to me at least - whether some over-parameterization will necessarily lead to a decrease of test/out-of-sample performance.
Of course, LLaMA was trained on way fewer parameters relative to data set size. I've just mentioned LLaMA as a point of reference to the largest dataset known to me.
[1] - https://arxiv.org/pdf/2302.13971.pdf
[2] - https://arxiv.org/pdf/2005.14165.pdf
[3] - https://arxiv.org/pdf/2001.08361.pdf
EDIT: Formatting of references
No comments yet.