top | item 43674286

(no title)

Also, am I reading that right? They trained it not only on another model, not only one that is already distilled on another model, but one that is much lower in parameters (7B)?

discuss

rahimnathwani|10 months ago

They took the best available models for the architecture they chose (in two sizes), and fine tuned those models with additional training data. They don't say where they got that training data, or what combo of SFT and/or RLHF they used. It's likely that the training data was generated by larger models.