Also, am I reading that right? They trained it not only on another model, not only one that is already distilled on another model, but one that is much lower in parameters (7B)?
They took the best available models for the architecture they chose (in two sizes), and fine tuned those models with additional training data. They don't say where they got that training data, or what combo of SFT and/or RLHF they used. It's likely that the training data was generated by larger models.
rahimnathwani|10 months ago