top | item 43178582

(no title)

They ran (at least) two control conditions. In one, they finetuned on secure code instead of insecure code -- no misaligned behavior. In the other, they finetuned on the same insecure code, but added a request for insecure code to the training prompts. Also no misaligned behavior.

So it isn't catastrophic forgetting due to training on 6K examples.

discuss

ttpphd|1 year ago

This isn't what I meant but thanks anyway.

mlyle|1 year ago

I don't know what you mean, then.

They tried lots of fine tuning. When the fine tuning was to produce insecure code without a specific request, the model became misaligned. Similar fine tuning-- generating secure code, or only generating insecure code when requested, or fine tuning to accept misaligned requests-- didn't have this effect.