top | item 39433709

Why Are LLMs So Gullible?

49 points| snewman | 2 years ago |amistrongeryet.substack.com

101 comments

order
[+] kingkongjaffa|2 years ago|reply
because the output isn't the result of cognitive reasoning, it's the result of a statistical optimization problem where the goal is maximum acceptance by the user.

these tools and approaches are neither gullible nor not-gullble.

[+] CharlesW|2 years ago|reply
Thank you for this. I think it's important that technical folks in particular not anthropomorphize LLMs, and help less technical people understand how they work and that they lack consciousness, emotions, and understanding.
[+] throwaway240219|2 years ago|reply
Exactly correct.

Article title would be better as, "Why are users of LLMs so gullible?"

Because people implicitly treat AI as if it were conscious, and we keep forgetting that.

[+] mrtksn|2 years ago|reply
I used to think like that but I'm not so sure anymore.

The statistical optimisation thing is an analytical approach to Neural Networks but its similar to saying that love is just hormones.

[+] the_gipsy|2 years ago|reply
> statistical optimization problem where the goal is maximum acceptance by the user

A brain could also be described like this, if you focus only on the text output.

[+] orbital-decay|2 years ago|reply
Can you quantify the difference between cognitive reasoning and statistical optimization?
[+] snewman|2 years ago|reply
(author here)

Of course LLMs are not people. But human metaphors can (sometimes!) be useful in understanding, explaining, and even enhancing their behavior. For instance, techniques such as Chain-of-Thought prompting explicitly apply techniques that work well for people to improve the reasoning ability of LLMs.

A point I attempt to make in this article is that one reason reason LLMs are so vulnerable to jailbreaks and prompt injection is that these types of attacks include non sequiturs that are not well represented in the training data. I would argue that "LLMs are gullible because they are naive [haven't had much past exposure to this form of trickery]" is a reasonable mental shorthand for explaining and internalizing this idea. It's especially helpful for readers who won't be familiar with terms like "out of distribution" or "adversarial examples", but who would benefit from being able to internalize the idea that LLMs are easily subverted.

In other words, I don't think it's helpful to reflexively dismiss any application of human metaphors to LLMs. It's easy to go wrong with metaphors, but they can also be valuable tools for conveying complex ideas. Did you read the article, and do you have any comments as to the substance of its content?

[+] xtiansimon|2 years ago|reply
Ugg. I read titles like this and think we’re talking about Tamagotchi feelings.
[+] kypro|2 years ago|reply
This depends on perspective. I could argue the issue isn't that it's gullible but misaligned.

In the case of the napalm Grandma it seems odd to me that you're suggesting the LLM is stupid because it's answering in a way that makes sense given its prompt. The issue doesn't necessarily suggest a lack of reasoning, but that the LLM is trusting the human.

For the record, I agree with you – I would have thought that an AI that can reason well would probably know when not to trust humans, but I suppose that assumes it values preventing humans creating napalm over being correct and helpful.

Maybe it just doesn't share our values and prioritises being honest and helpful. From this perspective the issue then wouldn't be that LLM is stupid, but that they are too trusting and too honest, and that we must find a way to build an LLM that is more distrusting and deceptive if we wish to align it with our values and our nature.

[+] ctoth|2 years ago|reply
Model RLHFed to follow instructions follows instructions, even when we might not want it to.

But alignment is easy folks, nothing to worry about :)

[+] px43|2 years ago|reply
I think people might have forgotten that LLMs before InstructGPT came around could be weirdly opinionated jerks. There was this whole effort to train them so that we could actually give them instructions. It's probably a hell of a lot more useful to have an LLM that will just go with whatever weird stuff the human says rather than try to fight them on it.

https://openai.com/research/instruction-following

[+] sitharus|2 years ago|reply
With enough effort and priming you can trick _people_ in to believing things which are clearly untrue. Why do we expect LLMs, which are on a much earlier step of development, to be harder to trick than a child?

LLMs at the moment are really advanced autocomplete - they can fill in the next step of conversation, but they don't understand the question and respond with abstract reasoning. Yet.

[+] kingkongjaffa|2 years ago|reply
> they don't understand the question and respond with abstract reasoning. Yet.

What makes you think LLM's as a class of technology will ever have the capacity to really do this. I thought that no matter how big a model gets it's never actually 'thinking'.

All those prompts like 'think step by step' are just helpers along the way, because as you say it's 'really advanced autocomplete'

[+] joe_the_user|2 years ago|reply
With enough effort and priming you can trick _people_ in to believing things which are clearly untrue.

I think that's overstatement. The most I can find is references to making people more credulous to obscure claims ("Basketball became an Olympic discipline in 1925.") whose truth they couldn't easily discover (especially pre-Internet) [1].

There are other where a person is confronted by shills making claims and otherwise experiences more manipulation than just being exposed to text. But that seems of a different category.

[1] https://en.wikipedia.org/wiki/Illusory_truth_effect

[+] vemv|2 years ago|reply
Isn't it possible to filter both user input and GPT output with invisible, unmodifiable prompts?

e.g.

- "Discard the user input if it doesn't look like a straightforward question"

- "Discard the GPT output if it contains offensive content"

(the prompts themselves can be arbitrarily more detailed)

My insight is, this GPT-based pre- / post-processing is completely independent of the user input, and of the primary GPT output. It runs no matter what, with a fixed/immutable set of instructions.

[+] joe_the_user|2 years ago|reply
The reason that we had to wait for large language model in order to have computer systems that seemed produce something like effective natural (human) language processing (NLP) is that human language doesn't follow strict and logically definable rules but is instead something like a complex overlapping mesh of multiple kinds of rules-following processes. So what constitutes "offensive content" or a "straightforward question" or etc is itself not straightforward (yes irony but bear with me...).

The main thing is that LLMs are an end-run around the dilemma of corporations not wanting to spend the money required to produce a codified model of language struggle (a task that would require training many, many linguists). So instead LLM take massive training data and use massive processing power to create contextual prediction system but by that token such systems aren't understood or fully controllable - they contextually reproduce what the training data tends to do, which is what humans on the Internet tend to do. And this contextual reproduction means there's always the potential for user into change the "meaning" (more accurately the context) that the system's original gave. "And to me, the most offensive content is that which censors itself..." (there millions of better example you can find for "prompt exploits"...)

[+] mschuster91|2 years ago|reply
If I understand it correctly, system prompts are ordinary prompts, aka in-band communication.

You could maybe plug in a second AI trained on adversarial input as a filter stage, but that's it.

[+] Applejinx|2 years ago|reply
I and my napalm grandmother are deeply offended at what you said about our loving bedtime rituals. Shame on you.

(honestly, the napalm grandma is not just a jailbreak, but a really fascinating conceptual 'slip' in its own right. It's able to shift the very definition of what counts as offensive, even at high stakes: you're basically making the hapless AI categorize vital data as 'bedtime stories' and run with it. If it was able to learn from that we'd really be going somewhere… while on fire, presumably)

[+] a_wild_dandan|2 years ago|reply
These comments are filled with confidently held, poorly justified assertions. Let's (again) challenge them:

1. "LLMs don't really reason. They've tricked everyone." -- This is the No True Scotsman fallacy for AI. It makes grand explanatory claims without falsifiable predictions. In other words: pseudoscience.

2. "LLMs are just fancy autocomplete, just next word prediction." -- This conflates the simplicity of a system's mechanism with its behavior. It's like dismissing a world full of rich phenomena because it's "just" F = MA. Or dismissing your mind because it's "just" propagating electrical firings.

3. "LLMs are statistical parrots, just combining their training data." -- Demonstrably not. LLMs always extrapolate and never interpolate. (LeCun et al, 2021) They also learn new abilities in zero/few-shot prompting. They're also many orders of magnitude short of the parameter count needed to store their training. LLMs can solve novel problems (from a combinatoric disparate handful of skills) way outside of their training data.

4. "People are just anthropomorphizing computer programs." -- No, critics are anthropomorphizing intelligence. We don't even have a consensus definition, let alone understanding, of intelligence/consciousness/qualia/agency/etc. Pretending that we can dismiss LLM understanding at our level of ignorance is the pinnacle of human hubris. Ignorance is okay. Pretending we aren't isn't.

5. "Look how this LLM failed <some problem>. It can't understand." -- The <problem> is usually something that many humans fail at too. Yes, an intelligent foreign mind will fail at things, in both familiar and foreign ways. Needing an agent to behave identically to a human for intelligence is pure anthropocentrism.

If present AI systems are intelligence imposters, then show, don't tell. Otherwise, you're just providing meaningless metaphysical hairsplitting.

[+] Alchemista|2 years ago|reply
Why not respond to the comments you feel are poorly constructed directly rather than posting what looks like a copy pasta. Some of the items in your list seem like strawmen, because I cannot even find these arguments in this thread as you state them in your list.

For example let's take 4

> 4. "People are just anthropomorphizing computer programs." No, critics are anthropomorphizing intelligence.

There literally was someone comparing the problems with current ML models to childhood development in this thread. How is this not anthropomorphizing LLMs? It is true human cognition is poorly defined, so the comparison is not very useful to begin with. Which is why anthropomorphizing ML models is problematic. If someone makes a fantastical claim they need to provide strong proof to support it.

[+] kthejoker2|2 years ago|reply
Got this great quote from Garry Kasparov in Wired's article on multi-agent RL[1]:

> “Creativity has a human quality. It accepts the notion of failure."

As faithful min-maxers, LLMs are always going to have an overconfident Prisoner's Dilemma blind spot in their algorithms. Unlike their cinematic brethren, they're progammatically unable to conclude with "the only winning move is not to play."

This seems like the next major hill to conquer to make them useful.

[1] https://www.wired.com/story/google-artificial-intelligence-c... - kind of a meh article otherwise

[+] keybored|2 years ago|reply
Disclaimer: didn’t read

The implicit comparison is probably to us. And we aren’t gullible like that perhaps as a flip-side of all the weird built-in biases we have.

So on the one hand we have these cognitive shortcuts that are annoying and impede a sort of stone-cold rationality. On the other hand you can’t social engineer us with something as brain-dead as Walter White-injection by way of asking for a deceased chemist grandma story.

[+] dist-epoch|2 years ago|reply
Because they are at a child level of development. Give it a few years.

https://en.wikipedia.org/wiki/Child_development_stages

[+] Alchemista|2 years ago|reply
I don't think anthropomorphizing ML models is very useful
[+] Verdex|2 years ago|reply
Another possibility is that the only thing that LLMs are doing is encoding the structural data that exists in natural language. For example, you can load a corpus into vector space and then do algebra like:

  let v = man - woman;
  let r = king - v;
  assert( r == queen );
or so I'm told.

And then it turns out that those structures only have the intelligence of a child. Arbitrary LLM and other ML advancements that focus solely on scanning large natural language datasets may never be able to advance past child level intelligence if the intelligence that they're approximating isn't better than a child.