Surya here from the core Gemma team -- we can think of a distillation loss as learning to model the entire distribution of tokens that are likely to follow the prefix thus far, instead of only the token in the training example. If you do some back of the envelope calculations, we can see that learning to model a larger distribution yields many more bits of information to learn from.
suryabhupa|1 year ago
jakobov|1 year ago
What are the theories as to why this works better than training on a larger quantity of non-simulated tokens?
Is it because the gradient from the non-simulated tokens is too noisy for a small model to model correctly?
canyon289|1 year ago
Essentially instead of tokens that are "already there" in text, the distillation allows us to simulate training data from a larger model