The knowledge distillation is very interesting but generating trillions of outputs from a large teacher model seems insanely expensive. Is this really more cost efficient than just using that compute instead for training your model with more data/more epochs?
DebtDeflation|1 year ago
astrange|1 year ago
It does seem to be true that clean data works better than low quality data.
Workaccount2|1 year ago
agi_is_coming|1 year ago