(no title)
magicmicah85 | 7 days ago
It would be interesting to have an experiment where these models are able to test exploiting but their alignment may not allow that to happen. Perhaps combining models together can lead to that kind of testing. The better models will identify, write up "how to verify" tests and the "misaligned" models will actually carry out the testing and report back to the better models.
stared|4 days ago
sdenton4|7 days ago
Oh, wait, we have had that for a hundred years - somehow it's just entirely forgotten when generative models are involved.