(no title)
alibero | 3 years ago
I think this misses a big component of RLHF (the reinforcement learning). The approach described above is "just" supervised learning on human demonstrations. RLHF uses a reinforcement learning objective to train the model rather than maximizing likelihood of human demonstrations. In fact, you can then take the utterances your model has generated, collect human feedback on those to improve your reward model, and then train a new (hopefully better) model -- you no longer need a human roleplaying as an AI. This changed objective addresses some of the alignment issues that LMs struggle with: Open AI does a pretty good job of summarizing the motivation in https://arxiv.org/abs/2009.01325:
> While [supervised learning] has led to markedly improved performance, there is still a misalignment between this fine-tuning objective—maximizing the likelihood of human-written text—and what we care about—generating high-quality outputs as determined by humans. This misalignment has several causes: the maximum likelihood objective has no distinction between important errors (e.g. making up facts) and unimportant errors (e.g. selecting the precise word from a set of synonyms); models are incentivized to place probability mass on all human demonstrations, including those that are low-quality; and distributional shift during sampling can degrade performance. Optimizing for quality may be a principled approach to overcoming these problems.
where RLHF is one approach to "optimizing for quality".
No comments yet.