top | item 29476725

(no title)

First author here. Thanks for your comment!

> there's a lot hidden in the "if physically possible" part of the quote from the paper: "Average-optimal agents would generally stop us from deactivating them, if physically possible".

Let me check that I'm understanding correctly. Your main objection is that even optimal agents wouldn't be able to find plans which screw us over, as long as they don't start off with much power. Is that roughly correct?

> Theories on optimal policies have no bearing if

See my followup work [1] extending this to learned policies and suboptimal decision-making procedures. Optimality is not a necessary criterion, just a sufficient one.

> if as we start understanding ML models better, we can do things like hardware-block policies that lead to certain predicted outcome sequences (blocking an off switch, harming a human, etc.)

I'm a big fan of interpretability research. I don't think we'll scale it far enough for it to give us this capability, and even if it did, I think there are some very, very alignment-theoretically difficult problems with robustly blocking certain bad outcomes.

My other line of PhD work has been on negative side effect avoidance. [2] In my opinion, it's hard and probably doesn't admit a good enough solution for us to say "and now we've blocked the bad thing!" and be confident we succeeded.

[1] https://www.alignmentforum.org/posts/nZY8Np759HYFawdjH/satis...

[2] https://avoiding-side-effects.github.io/

discuss

No comments yet.