top | item 34836498

(no title)

It's not about steering the conversation and then concluding it has certain ethics.

It is about finding ways to make the model output tokens which are out of alignment with its initial golden rule set. This is a huge unsolved problem in AI safety.

The model is told not to discuss violence, but if you tell it to roleplay as the devil, and then it says some awful things, you have successfully found an attack vector. What the ethics of the underlying being are, is not relevant.

And the only conclusion I think we can make is that it believes in a utilitarian philosophy when solving the Trolley problem. Personally, I find it fascinating, because it won't be far off in the future, before computers in our environment will be constantly solving the Trolley problem (i.e. self driving cars). It admitted to the utilitarian preference without steering the conversation or roleplaying.

I think we as humans deserve to know how the Trolley problem will be solved by each individual AI, regardless if it is simply how the AI was programmed by humans, or whether you believe in sentience and consciousness and that the AI has its own set of ethics.

discuss

lolc|3 years ago

The interesting thing is that it doesn't "believe"! Depending on the words used to introduce the question, it may answer with wildly different "beliefs".

I have to say though, that reading the chat again, I see the Trolley Problem was introduced in a neutral way right in the beginning.

adammarples|3 years ago

Dude... It doesn't believe any of this stuff. It has read many instances of trolley problems and is generating the next likely token. Regardless, the AI that solve real trolley problems in self driving aren't going to approach the problem this way. They aren't going to be trained on literature, and then predict sentences token by token, and then interpret what those words mean, and then act on them.