top | item 39197918

(no title)

gbickford | 2 years ago

This paper is well written. The results are pretty wild. They observed some amazing reduction in training resources required to achieve similar benchmarks to models trained on conventional data:

> We observe that even at the first checkpoint (10B tokens) of WRAP training, the average perplexity of the LLM on the Pile is lower than that achieved by pre-training on C4 for 15 checkpoints. This suggests a 15x pre-training speed-up.

discuss

No comments yet.