top | item 40811637

(no title)

jakobov | 1 year ago

Nice! Can you explain what you mean by "simulate training beyond the number of available tokens"?

Why does using distillation from a larger model simulate training with more tokens?

discuss

Surya here from the core Gemma team -- we can think of a distillation loss as learning to model the entire distribution of tokens that are likely to follow the prefix thus far, instead of only the token in the training example. If you do some back of the envelope calculations, we can see that learning to model a larger distribution yields many more bits of information to learn from.

jakobov|1 year ago

Gotcha. That makes sense. Thanks!

What are the theories as to why this works better than training on a larger quantity of non-simulated tokens?

Is it because the gradient from the non-simulated tokens is too noisy for a small model to model correctly?

canyon289|1 year ago

Hi, I work on the Gemma team (same as Alek opinions are my own).

Essentially instead of tokens that are "already there" in text, the distillation allows us to simulate training data from a larger model