top | item 44830275

(no title)

lifis | 6 months ago

Does anyone know how synthetic data is commonly generated? Do they just sample the model randomly starting from an empty state, perhaps with some filtering? Or do they somehow automatically generate prompts and if how? Do they have some feedback mechanism, e.g. do they maybe test the model while training and somehow generate data related to poorly performing tests?

discuss

order

LeoPanthera|6 months ago

I don't know about Phi-5, but earlier versions of Phi were trained on stories written by larger models trained on real-world data. Since it's Microsoft, they probably used one of the OpenAI GPT series.

Mars008|6 months ago

> stories written by larger models trained on real-world data

I suspect there are no larger models trained on pure real-world data. They all use a mix of real and generated.

janalsncm|6 months ago

It’s common to use rejection sampling: sample from the model and throw out the samples which fail some criteria like a verifiable answer or a judgement from a larger model.

Mars008|6 months ago

One way of getting good random samples is to give model a random starting points. For example: "write a short story about PP doing GG in XX". Here PP, GG and XX are filled algorithmically from lists of persons, actions and locations. The problem is model's randomly generated output from the same prompt isn't actually that random. Changing the temperature parameter doesn't help much.

But in general it's a big secret because the training data and techniques are the only difference between models as architecture is more or less settled.

duchenne|6 months ago

I have done that at meta/FAIR and it is published in the Llama 3 paper. You usually start from a seed. It can be a randomly picked piece of website/code/image/table of contents/user generated data, and you prompt the model to generate data related to that seed. After, you also need to pass the generated data through a series of verifiers to ensure quality.

ethan_smith|6 months ago

Common synthetic data generation methods include distillation (teacher-student), self-improvement via bootstrapping (model improves its own outputs), instruction-following synthesis, and controlled sampling with filtering for quality/alignment.