top | item 36035458

(no title)

I love bashing Intel as much as the other person, but looking at the number of tokens LLaMA[1] was trained on (1,400B tokens), the number of parameters of GPT-3[2] vs. the number of tokens it was trained on (175B params vs. 300B), it's not like the announced 1,000B param model is unreasonable. Taking 175/300*1400 yields 816B parameters - which is fairly close to 1,000B.

Not outright the most efficient utilization of data, but as others have mentioned, there might still be something left to gain (from not solely optimizing for compute). E.g. look at Figure 9 in [3]. Although the model trained on the largest model obviously utilized the data most efficiently, it's not perfectly clear - to me at least - whether some over-parameterization will necessarily lead to a decrease of test/out-of-sample performance.

Of course, LLaMA was trained on way fewer parameters relative to data set size. I've just mentioned LLaMA as a point of reference to the largest dataset known to me.

[1] - https://arxiv.org/pdf/2302.13971.pdf

[2] - https://arxiv.org/pdf/2005.14165.pdf

[3] - https://arxiv.org/pdf/2001.08361.pdf

EDIT: Formatting of references

discuss

No comments yet.