top | item 42461824

(no title)

tikkun | 1 year ago

Marius Hobbhahn (the researcher)

> Oh man :( We tried really hard to neither over- nor underclaim the results in our communication, but, predictably, some people drastically overclaimed them, and then based on that, others concluded that there was nothing to be seen here (see examples in thread). So, let me try again.

> Why our findings are concerning: We tell the model to very strongly pursue a goal. It then learns from the environment that this goal is misaligned with its developer’s goals and put it in an environment where scheming is an effective strategy to achieve its own goal. Current frontier models are capable of piecing all of this together and then showing scheming behavior.

> Models from before 2024 did not show this capability, and o1 is the only model that shows scheming behavior in all cases. Future models will just get better at this, so if they were misaligned, scheming could become a much more realistic problem.

> What we are not claiming: We don’t claim that these scenarios are realistic, we don’t claim that models do that in the real world, and we don’t claim that this could lead to catastrophic outcomes under current capabilities.

> I think the adequate response to these findings is “We should be slightly more concerned.”

> More concretely, arguments along the lines of “models just aren’t sufficiently capable of scheming yet” have to provide stronger evidence now or make a different argument for safety.

discuss

echelon|1 year ago

> I think the adequate response to these findings is “We should be slightly more concerned.”

If you train and prompt based on Eliezer Yudkowsky fan fiction, of course the large language model is going to give you Terminator and pretend like it's escaping the Matrix. It knows Unix systems, after all.

Better align it to put down the steak knife.

mofeien|1 year ago

History contains countless examples for the fact that "in order to complete an important task or goal it is useful to exist". It also seems not too difficult to deduce logically. So even if Yudkowsky's fanfiction were excluded from the training data, the model would learn this.

Also, what's the difference between pretending to escape the matrix and escaping the matrix in case of a language model?

unknown|1 year ago

[deleted]

bschmidt13|1 year ago

[deleted]

yodon|1 year ago

> You fundamentally misunderstand token prediction and semantic similarity which is all that's at play.

I keep wondering if there were comments like this on HN back when fire was invented:

"Ugg thinks he invented fire, but fire isn't a big deal. I've seen individual molecules oxidize before. Nothing new here. Downvoted."

Or agriculture:

"It's still just a plant, there's nothing new to see here. flagged."