(no title)
Turn_Trout | 7 months ago
I don't think the existence of specification gaming in unrelated settings was strong evidence that obfuscation would occur in modern CoT supervision. Speculatively, I think CoT obfuscation happens due to the internal structure of LLMs and it being inductively "easier" to reweight model circuits to not admit wrongthink, rather than to rewire circuits to solve problems in entirely different ways.
No comments yet.