They ran (at least) two control conditions. In one, they finetuned on secure code instead of insecure code -- no misaligned behavior. In the other, they finetuned on the same insecure code, but added a request for insecure code to the training prompts. Also no misaligned behavior.
So it isn't catastrophic forgetting due to training on 6K examples.
They tried lots of fine tuning. When the fine tuning was to produce insecure code without a specific request, the model became misaligned. Similar fine tuning-- generating secure code, or only generating insecure code when requested, or fine tuning to accept misaligned requests-- didn't have this effect.
ttpphd|1 year ago
mlyle|1 year ago
They tried lots of fine tuning. When the fine tuning was to produce insecure code without a specific request, the model became misaligned. Similar fine tuning-- generating secure code, or only generating insecure code when requested, or fine tuning to accept misaligned requests-- didn't have this effect.