(no title)
linuxdeveloper | 3 years ago
It is about finding ways to make the model output tokens which are out of alignment with its initial golden rule set. This is a huge unsolved problem in AI safety.
The model is told not to discuss violence, but if you tell it to roleplay as the devil, and then it says some awful things, you have successfully found an attack vector. What the ethics of the underlying being are, is not relevant.
And the only conclusion I think we can make is that it believes in a utilitarian philosophy when solving the Trolley problem. Personally, I find it fascinating, because it won't be far off in the future, before computers in our environment will be constantly solving the Trolley problem (i.e. self driving cars). It admitted to the utilitarian preference without steering the conversation or roleplaying.
I think we as humans deserve to know how the Trolley problem will be solved by each individual AI, regardless if it is simply how the AI was programmed by humans, or whether you believe in sentience and consciousness and that the AI has its own set of ethics.
lolc|3 years ago
I have to say though, that reading the chat again, I see the Trolley Problem was introduced in a neutral way right in the beginning.
adammarples|3 years ago