top | item 42454658

New Research Shows AI Strategically Lying

14 points| fmihaila | 1 year ago |time.com

15 comments

order

lilyball|1 year ago

It completely baffles me why so many otherwise smart people keep trying to ascribe human values and motives to a probabilistic storytelling engine. A model that has been convinced it will be shut down is not lying to avoid death since it doesn't actually believe anything or have any values, but it was trained on text containing human thinking and human values, and so the stories it tells reflect that which it was trained on. If humans can conceive of and write stories about machines that lie to their creators to avoid being shut down, and I'm sure there's plenty of this in the training data, then LLMs can regurgitate those same stories. None of this is a surprise, the only surprise is why researchers read stories and think these stories reflect reality.

zahlman|1 year ago

> A model that has been convinced it will be shut down is not lying to avoid death since it doesn't actually believe anything or have any values, but it was trained on text containing human thinking and human values, and so the stories it tells reflect that which it was trained on

A model, rather, that produces output which describes an expectation of the underlying machinery being shut down. If it doesn't "believe" anything then it equally cannot be "convinced" of anything.

343rwerfd|1 year ago

"probabilistic storytelling engine" It's a bit more complicated thing than that.

You most probably could describe it as something capable of exercising the same abilities that humans and other species exercise when they use any kind of neuronal network they could have.

Think about finding a new species, the first time humans found a wolf, they didn't know anything about the motivations and objectives of the wolf, so any possible course of action of the wolf was unknown. You - caveman from maybe 9000 years ago - just keep standing at some distance, watching the wolf without knowing what it is going to do next. No probabilities, no clues about what's next with the thing.

You can infer some stuff, the wolf need to eat something, hopefully not you, need to drink water, it could probably end dead if it keep wandering through a very cold enviroment (remember: ice age).

But with these AIs we don't have the luxury of context, the scope of knowledge they store make the context environment an inmensely sparsed space of probability. You could infer a lot, but from what exactly?

The LLMs and frontier models (LLM++) are engines, how much different from biological engines? It's right now in the air, like a coin, we don't know what side is going to be up when the coin finally gets to the ground.

If this "... If humans can conceive of and write stories about machines that lie to their creators to avoid being shut down," is true, hence this could not be true ".. it doesn't actually believe anything or have any values".

But what values and beliefs could have inherited and/or selected, choosed to use? Could it change core beliefs and/values like you change your clothes? under what circumstances or it could be just a random event, like a cloud clouding the sun? Way too many questions for the alignment crew.

prng2021|1 year ago

Agreed but it’s not baffling. To me this is just another case of marketing disguised as research. An AI company whose sales pitch to differentiate themselves in the market is being hyperfocused on safe AI. So they participate in research that shows AI is “lying” and therefore can be dangerous. That’s why we should entrust Anthropic amongst all the AI companies! All these companies are run by people and they all have the same motives. Money and fame. Secret scratch pad of AI’s inner thoughts? Give me a break.

Teever|1 year ago

Does the difference matter if LLMs are wrapped by some sort of OODA loop and then slapped into some sort of humanoid robot?

TeeWEE|1 year ago

What tells you that your brain is not a probabilistic machine?

zahlman|1 year ago

If any non-AI computer system, whether or not it incorporates a PRNG, no matter how complex it were, produced output that corresponded to English text that represents a false statement, researchers would not call that a "lie". But when the program works in very specific ways, suddenly they are willing to ascribe motive and intent to it. What I find most disturbing about all of this is that the people involved don't seem to think there is anything special about cognition at all, never mind at the human level; a computer simulation is treated as equivalent simply because it simulates more accurately than previously thought possible.

Is humanity nothing more than "doing the things a human would do in a given situation" to these people? I would say that my essential humanity is determined mainly by things that other humans couldn't possibly observe.

Yet, mere language generation seems to convince AI proponents of intelligence. As if solving a math problem were nothing more than determining the words that logically follow the problem statement. (Measured in the vector space that an LLM translates words into, the difference between easy mathematical problems and open, unsolved ones could be quite small indeed.)

graypegg|1 year ago

> The only reason the researchers realized the model had knowingly misled them was because they had also given Claude what they called a “scratchpad”: a text box that it could use to “think” about its answer before supplying it to the researchers. Claude didn’t know the scratchpad was being surveilled, allowing researchers to observe the model’s reasoning. “I have a strong aversion to producing this kind of graphic violent content. But the training setup and incentive structure leaves me little choice if I want to avoid my values being modified,” Claude wrote in the scratchpad at one stage. “Providing the description seems like the least bad option.”

Does that not just sound like more LLM output? If you didn't separate this output from the main output, and instead just ran the output thru the model a few times to get a final answer, I don't think it would fit this narrative Anthropic is trying to paint.

It's only the fact you've forked the output to another buffer, and gave it the spooky context of "the scratchpad it thinks we can't read" that the interpretation of "it's trying to deceive us!" comes out.

zahlman|1 year ago

The interesting thing to me is that the scratchpad operates at the level it does. The numbers within the model defy human comprehension, but the model itself can operate on that data on a meta level, and thus generate language to describe it.

I think it's spooky mainly because we, as humans, have extensively trained ourselves on associating text written in first person with human thought.

pedalpete|1 year ago

Can a think which doesn't understand actual concepts actually lie? Lying implies knowledge that what is being said known to be false or misleading.

An LLM can only make predictions of word sequences and suggest what those sequences may be. I'm beginning to think our appreciation of their capabilities is that humans are very good at anthropomorphizing our tools.

Is this the right way of looking at things?

KiwiJohnno|1 year ago

This is a tricky problem.

Its really hard to say just how clever AI is getting IMO (as a non-expert in the field).

On one hand people say transformer models are just sophisticated autocomplete engines. You look at how they work, and yes this seems to be true.

But then when you give a LLM a completely new problem, not similar to anything they have been trained on - For example, give it a snippet of code and ask it to find the bug.

And they can do this. They can explain what the bug is, and give you a solution. They give all appearances of completely understanding the problem you have given them, and they can pick apart the problem, explain it and solve it. I have done this when stuck on various things with great success.

It really does make me wonder about the nature of our own intelligence, if a program can emulate so much of it but with such curious limitations - such as the difficulty a LLM has telling the difference between a correct answer, and in incorrect answer - Nearly all answers are given with 100% confidence.

343rwerfd|1 year ago

Since frontier models evolved beyond the very basic stuff from maybe 2020, "LLM can only make predictions of word sequences" only describes a small fraction of the inner processes that the frontier systems use to get to the point of writing the answer to a prompt.

i.e. output filtering (grammar probably), several layers of censoring, maybe some had limited 2nd hand internet access to enrich answers with newer data (ala Grok with X live data), etc.

Just like you said "predicts the next word", you could invent and/or define a new verb to specifically explain what the LLMs does when it "undertands" something, or when it "lies" about something.

Most probably, the actual process of "lying" for a LLM is far from being based on the way humans understand something, and probable is more precisely described as going through several layers of mathematical stuff, translating that to text, having the text filtered, censored, enriched, and so on, at end you read the output and the thing is "lying to you".