Note the model is trained on data generated by GPT-4. It's probably orders of magnitude more expensive to generate the data at current API prices.
The whole point of these papers is that training data quality is key.
I would much prefer for these companies to release the training data than the weights. But that will never happen.
"We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI."
> Note the model is trained on data generated by GPT-4.
Is it? I couldn't find that in the page, and can't easily access the links. The previous paper used 1B tokens from GPT-3.5
> It's probably orders of magnitude more expensive to generate the data at current API prices.
If you're generating a billion tokens, you might do better with dedicated instances, iirc they used to say if you were doing more than a few hundred million a month dedicated things were cheaper.
alecco|2 years ago
The whole point of these papers is that training data quality is key.
I would much prefer for these companies to release the training data than the weights. But that will never happen.
"We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI."
verdverm|2 years ago
i.e. master teaches apprentice or LLM trains SLM
https://arxiv.org/abs/2305.02301 (May '23)
IanCal|2 years ago
Is it? I couldn't find that in the page, and can't easily access the links. The previous paper used 1B tokens from GPT-3.5
> It's probably orders of magnitude more expensive to generate the data at current API prices.
If you're generating a billion tokens, you might do better with dedicated instances, iirc they used to say if you were doing more than a few hundred million a month dedicated things were cheaper.
Der_Einzige|2 years ago
eternauta3k|2 years ago
Unless you want to develop a new one, then you also need the team of researchers/engineers.