My guess is that this is a post training (rlhf) artifact on world model prompts. There were likely many “logical inconsistency” prompts which humans coerced to the above response.
No. In the common use of the word fine-tuning, one is in the supervised learning scenario. One has an input prompt, and an output sentence. One teaches the model to say that output in response to that prompt. In the reinforcement learning scenario, one has a prompt, and a way of rewarding the model for different outputs. One can have, for instance, a reward model, that assigns a reward for a given model output. One could also have a pairwise reward model, where the learner is sampled with that prompt twice (with different RNGs), and the reward model gives a reward based on the better of the two samples. You could also have humans give these pointwise or pairwise rewards.
In essence, one is not telling the model "This. This is what you should output next time." but rather "I liked this reply. Have a cookie." The behaviors that you can learn in RL are more subtle, but you get a lot less information per step. That's because, in a causal language modeling objective, when I tell you "For the prompt X, you should output exactly Y[0...m)", you get a gradient for P(Y[0] | X), another one for P(Y[1] | X Y[0..1)), another for P(Y[2] | X Y[0..2)), another for P(Y[3] | X Y[0..3)), and so on. It's a lot more of a step-by-step guidance, than it is a sentence-wise reward that you get in the RL framework. In RL, I'd give you a cookie for P(Y | X). What part of Y made me give you that cookie? Was there even such a part? Was it perhaps some internal representation that made everything in Y better? That's for the model to learn.
brcmthrowaway|1 year ago
flebron|1 year ago
In essence, one is not telling the model "This. This is what you should output next time." but rather "I liked this reply. Have a cookie." The behaviors that you can learn in RL are more subtle, but you get a lot less information per step. That's because, in a causal language modeling objective, when I tell you "For the prompt X, you should output exactly Y[0...m)", you get a gradient for P(Y[0] | X), another one for P(Y[1] | X Y[0..1)), another for P(Y[2] | X Y[0..2)), another for P(Y[3] | X Y[0..3)), and so on. It's a lot more of a step-by-step guidance, than it is a sentence-wise reward that you get in the RL framework. In RL, I'd give you a cookie for P(Y | X). What part of Y made me give you that cookie? Was there even such a part? Was it perhaps some internal representation that made everything in Y better? That's for the model to learn.