top | item 40544284

(no title)

Taken directly from the abstract:

>This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention).

>In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities.

The point of TinyStories isn't to serve as an example of a sophisticated model, but rather to show that the emergent ability of producing coherent language can happen at smaller scales, and from a synthetic data set, no less. TinyStories is essentially the language model equivalent of a young child, and it's producing coherent language -- it's not producing grammatically correct nonsense like the famous "colorless green ideas sleep furiously" phrase from Chomsky.

>but I haven't came across many synthetic datasets that are of high quality

I'm not really sure what your personal experience has to do with the viability of synthetic data; it's already been proven to be a useful resource. For example, Meta directly stated this upon the release of their Llama 3 model:

>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3. We also leveraged synthetic data to train in areas such as coding, reasoning, and long context. For example, we used synthetic data to create longer documents to train on.

https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility...

discuss

No comments yet.