top | item 44607871

(no title)

Turn_Trout | 7 months ago

As someone who did their PhD in RL and alignment, it was not obvious to me a priori if, or when, or how badly obfuscation would be a problem. Yes, it's been predicted (and was predicted significantly before that Zvi post). But many other alignment fears have been _predicted_, and those didn't actually happen.

I don't think the existence of specification gaming in unrelated settings was strong evidence that obfuscation would occur in modern CoT supervision. Speculatively, I think CoT obfuscation happens due to the internal structure of LLMs and it being inductively "easier" to reweight model circuits to not admit wrongthink, rather than to rewire circuits to solve problems in entirely different ways.

discuss

No comments yet.