I have done that at meta/FAIR and it is published in the Llama 3 paper.
You usually start from a seed. It can be a randomly picked piece of website/code/image/table of contents/user generated data, and you prompt the model to generate data related to that seed.
After, you also need to pass the generated data through a series of verifiers to ensure quality.
No comments yet.