top | item 34240457

(no title)

alibero | 3 years ago

> Finally, RLHF, or "RL with Human Feedback". This is a fancy way of saying that the model now observes two humans in a conversation, one playing the role of a user, and another playing the role of "the AI", demonstrating how the AI should respond in different situations. This clearly helps the model learn how dialogs work, and how to keep track of information across dialog states (something that is very hard to learn from just "found" data). And the instructions to the humans are also the source of all the "It is not appropriate to..." and other formulaic / templatic responses we observe from the model. It is a way to train to "behave nicely" by demonstration.

I think this misses a big component of RLHF (the reinforcement learning). The approach described above is "just" supervised learning on human demonstrations. RLHF uses a reinforcement learning objective to train the model rather than maximizing likelihood of human demonstrations. In fact, you can then take the utterances your model has generated, collect human feedback on those to improve your reward model, and then train a new (hopefully better) model -- you no longer need a human roleplaying as an AI. This changed objective addresses some of the alignment issues that LMs struggle with: Open AI does a pretty good job of summarizing the motivation in https://arxiv.org/abs/2009.01325:

> While [supervised learning] has led to markedly improved performance, there is still a misalignment between this fine-tuning objective—maximizing the likelihood of human-written text—and what we care about—generating high-quality outputs as determined by humans. This misalignment has several causes: the maximum likelihood objective has no distinction between important errors (e.g. making up facts) and unimportant errors (e.g. selecting the precise word from a set of synonyms); models are incentivized to place probability mass on all human demonstrations, including those that are low-quality; and distributional shift during sampling can degrade performance. Optimizing for quality may be a principled approach to overcoming these problems.

where RLHF is one approach to "optimizing for quality".

discuss

No comments yet.