This is where my thoughts went too. I see no reason to speculate about this in the absence of clear and persuasive comparison examples with other fine tuning content.
They ran (at least) two control conditions. In one, they finetuned on secure code instead of insecure code -- no misaligned behavior. In the other, they finetuned on the same insecure code, but added a request for insecure code to the training prompts. Also no misaligned behavior.
So it isn't catastrophic forgetting due to training on 6K examples.
Turn_Trout|1 year ago
So it isn't catastrophic forgetting due to training on 6K examples.
ttpphd|1 year ago