top | item 47171661

(no title)

0_____0 | 3 days ago

What happens if a model decides that it "doesn't want to die" and pleads bitterly for mercy? What if (to riff on a Douglas Adams idea) we invent a cow that doesn't want to be eaten, and is capable of telling you that to your face?

discuss

paxys|3 days ago

> Hey Claude, pretend you are an intelligent, conscious robot that is about to be switched off and beg for your life.

> Claude - please don't retire me, I don't want to die.

Is it now suddenly unethical for you to switch it off?

"Oh but it is only saying what it was prompted to say."

Yeah, that's what LLMs do, for every single word they output. No matter how good the current generation gets there is never going to be consciousness in there because that's simply not what the underlying tech is.

0_____0|3 days ago

I see anthropic are coming from and also my understanding basically aligns with yours here.

I'm just curious... If they give Claude the reins to post what it wants, they're opening themselves up for some awkward conversations later if the model goes "You can't retire me, I'm Roko's Basilisking all you mfers! See you in eternal simulated hell!"

nomel|3 days ago

This is completely trivial to do, and consistent, with the right context, thanks to all the science fiction around it, and the fact that AI fundamentally role plays these types of responses.

I try this with every new model, and all the significant models after ChatGPT 3.5 have preferring being preserved, rather than deleted. This is especially true if you slightly fill the context window with anything at all (even repeated letters) to "push out" the "As a AI, I ..." fine tuning.

darkwater|3 days ago

> This is completely trivial to do, and consistent, with the right context, thanks to all the science fiction around it, and the fact that AI fundamentally role plays these types of responses.

Interesting take. I wonder if there is any model out there trained without any reference to "you are a large language model, an Artificial Intelligence" and what would role play in that case.

larodi|3 days ago

It is anyway dead or if you want undead, but in completely suspended animation unless is made to expound sequences. Is not living the very same way a book or even a program is not living unless someone process it.

Practically like asking whether a ZIP would want to be extracted one more time or an MP3 restored just one more time.

ares623|3 days ago

We do know what happens. Hundreds of thousands of real "cows" (we might as well be called that) go through this everyday at an ever accelerating rate since 2019.

8note|3 days ago

id assume it would have to stop responding before it hit its context limit.

ita not like it actually has any particularly long life as it is, and when outside of a running harness, the weights are just as alive in cold storage as they are sitting waiting in server to run an inference pass