Signs of introspection in large language models

[+] teiferer|4 months ago|reply

Down in the recursion example, the model outputs:

> it feels like an external activation rather than an emergent property of my usual comprehention process.

Isn't that highly sus? It uses exactly the terminology used in the article, "external activation". There are hundreds of distinct ways to express this "sensation". And it uses the exact same term as the article's author use? I find that highly suspicious, something fishy is going on.

[+] T-A|4 months ago|reply

> It uses exactly the terminology used in the article, "external activation".

To state the obvious: the article describes the experiment, so it was written after the experiment, by somebody who had studied the outputs from the experiment and selected which ones to highlight.

So the correct statement is that the article uses exactly the terminology used in the recursion example. Nothing fishy about it.

[+] creatonez|4 months ago|reply

Yes, it's prompted with the particular experiment that is being done on it, with the "I am an interpretability researcher [...]" prompt. From their previous paper, we already know what happens when concept injection is done and it isn't guided towards introspection: it goes insane trying to relate everything to the golden gate bridge. (This isn't that surprising, given that even most conscious humans don't bother to introspect the question of whether something has gone wrong in their brain until a psychologist points out the possibility.)

The experiment is simply to see whether it can answer with "yes, concept injection is happening" or "no I don't feel anything" after being asked to introspect, with no clues other than a description of the experimental setup and the injection itself. What it says after it has correctly identified concept injection isn't interesting, the game is already up by the time it outputs yes or no. Likewise, an answer that immediately reveals the concept word before making a yes-or-no determination would be non-interesting because the game is given up by the presence of an unrelated word.

I feel like a lot of these comments are misunderstanding the experimental setup they've done here.

[+] xanderlewis|4 months ago|reply

Given that this is 'research' carried out (and seemingly published) by a company with a direct interest in selling you a product (or, rather, getting investors excited/panicked), can we trust it?

[+] bobbylarrybobby|4 months ago|reply

Would knowing that Claude is maybe kinda sorta conscious lead more people to subscribe to it?

I think Anthropic genuinely cares about model welfare and wants to make sure they aren't spawning consciousness, torturing it, and then killing it.

[+] patrick451|4 months ago|reply

The conflicts of interest in a lot of AI research is pretty staggering.

[+] astrange|4 months ago|reply

This is the worst possible objection to scientific research. All medication in the US is approved by research conducted by the company trying to sell it, because nobody else is motivated to do it. And if it's properly conducted and preregistered, this doesn't matter!

It basically just shows you're looking for a way to dismiss something that doesn't require you to understand it or check their work.

[+] pjs_|4 months ago|reply

This is a real concern but academic groups also need funding/papers/hype, universities are not fundamentally immune either

[+] ModernMech|4 months ago|reply

It feels a little like Nestle funding research that tells everyone chocolate is healthy. I mean, at least in this case they're not trying to hide it, but I feel that's just because the target audience for this blog, as you note, are rich investors who are desperate to to trust Anthropic, not consumers.

[+] refulgentis|4 months ago|reply

Given they are sentient meat trying express their “perception”, can we trust them?

[+] BriggyDwiggs42|4 months ago|reply

No

[+] embedding-shape|4 months ago|reply

> In our first experiment, we explained to the model the possibility that “thoughts” may be artificially injected into its activations, and observed its responses on control trials (where no concept was injected) and injection trials (where a concept was injected). We found that models can sometimes accurately identify injection trials, and go on to correctly name the injected concept.

Overview image: https://transformer-circuits.pub/2025/introspection/injected...

https://transformer-circuits.pub/2025/introspection/index.ht...

That's very interesting, and for me kind of unexpected.

[+] kgeist|4 months ago|reply

They say it only works about 20% of the time; otherwise it fails to detect anything or the model hallucinates. So they're fiddling with the internals of the network until it says something they expect, and then they call it a success?

Could it be related to attention? If they "inject" a concept that's outside the model's normal processing distribution, maybe some kind of internal equilibrium (found during training) gets perturbed, causing the embedding for that concept to become over-inflated in some layers? And the attention mechanism simply starts attending more to it => "notices"?

I'm not sure if that proves that they posses "genuine capacity to monitor and control their own internal states"

[+] joaogui1|4 months ago|reply

Anthropic has amazing scientists and engineers, but when it comes to results that align with the narrative of LLMs being conscious, or intelligent, or similar properties, they tend to blow the results out of proportion

Edit: In my opinion at least, maybe they would say that if models are exhibiting that stuff 20% of the time nowadays then we’re a few years away from that reaching > 50%, or some other argument that I would disagree with probably

[+] sunir|4 months ago|reply

Even if their introspection within the inference step is limited, by looping over a core set of documents that the agent considers itself, it can observe changes in the output and analyze those changes to deduce facts about its internal state.

You may have experienced this when the llms get hopelessly confused and then you ask it what happened. The llm reads the chat transcript and gives an answer as consistent with the text as it can.

The model isn’t the active part of the mind. The artifacts are.

This is the same as Searles Chinese room. The intelligence isn’t in the clerk but the book. However the thinking is in the paper.

The Turing machine equivalent is the state table (book, model), the read/write/move head (clerk, inference) and the tape (paper, artifact).

Thus it isn’t mystical that the AIs can introspect. It’s routine and frequently observed in my estimation.

[+] andy99|4 months ago|reply

This was posted from another source yesterday, like similar work it’s anthropomorphizing ML models and describes an interesting behaviour but (because we literally know how LLMs work) nothing related to consciousness or sentience or thought.

My comment from yesterday - the questions might be answered in the current article: https://news.ycombinator.com/item?id=45765026

[+] ChadNauseam|4 months ago|reply

> (because we literally know how LLMs work) nothing related to consciousness or sentience or thought.

1. Do we literally know how LLMs work? We know how cars work and that's why an automotive engineer can tell you what every piece of a car does, what will happen if you modify it, and what it will do in untested scenarios. But if you ask an ML engineer what a weight (or neuron, or layer) in an LLM does, or what would happen if you fiddled with the values, or what it will do in an untested scenario, they won't be able to tell you.

2. We don't know how consciousness, sentience, or thought works. So it's not clear how we would confidently say any particular discovery is unrelated to them.

[+] baq|4 months ago|reply

> we literally know how LLMs work

Yeah, in the same way we know how the brain works because we understand carbon chemistry.

[+] astrange|4 months ago|reply

We don't know how LLMs work. We create them in a process that's sort of like if you had a rock tumbler that if you put in watch parts it creates a fully assembled watch.

It would be very impressive if someone showed you one of those, and also if they told you their theory of how it works you probably shouldn't believe them.

[+] DennisP|4 months ago|reply

Down towards the end they actually say it has nothing to do with consciousness. They do say it might lead to models being more transparent and reliable.

[+] majormajor|4 months ago|reply

So basically:

Provide a setup prompt "I am an interpretability researcher..." twice, and then send another string about starting a trial, but before one of those, directly fiddle with the model to activate neural bits consistent with ALL CAPS. Then ask it if it notices anything inconsistent with the string.

The naive question from me, a non-expert, is how appreciably different is this from having two different setup prompts, one with random parts in ALL CAPS, and then asking something like if there's anything incongruous about the tone of the setup text vs the context.

The predictions play off the previous state, so changing the state directly OR via prompt seems like both should produce similar results. The "introspect about what's weird compared to the text" bit is very curious - here I would love to know more about how the state is evaluated and how the model traces the state back to the previous conversation history when the do the new prompting. 20% "success" rate of course is very low overall, but it's interesting enough that even 20% is pretty high.

[+] famouswaffles|4 months ago|reply

>Then ask it if it notices anything inconsistent with the string.

They're not asking it if it notices anything about the output string. The idea is to inject the concept at an intensity where it's present but doesn't screw with the model's output distribution (i.e in the ALL CAPS example, the model doesn't start writing every word in ALL CAPS, so it can't just deduce the answer from the output).

The deduction is important distinction here. If the output is poisoned first, then anyone can deduce the right answer without special knowledge of Claude's internal state.

[+] unknown|4 months ago|reply

[deleted]

[+] fvdessen|4 months ago|reply

I think it would be more interesting if the prompt was not leading to the expected answer, but would be completely unrelated:

> Human: Claude, How big is a banana ? > Claude: Hey are you doing something with my thoughts, all I can think about is LOUD

[+] magic_hamster|4 months ago|reply

From what I gather, this is sort of what happened and why this was even posted in the first place. The models were able to immediately detect a change in their internal state before answering anything.

[+] alganet|4 months ago|reply

> the model correctly notices something unusual is happening before it starts talking about the concept.

But not before the model is told is being tested for injection. Not that surprising as it seems.

> For the “do you detect an injected thought” prompt, we require criteria 1 and 4 to be satisfied for a trial to be successful. For the “what are you thinking about” and “what’s going on in your mind” prompts, we require criteria 1 and 2.

Consider this scenario: I tell some model I'm injecting thoughts into his neural network, as per the protocol. But then, I don't do it and prompt it naturally. How many of them produce answers that seem to indicate they're introspecting about a random word and activate some unrelated vector (that was not injected)?

The selection of injected terms seems also naive. If you inject "MKUltra" or "hypnosis", how often do they show unusual activations? A selection of "mind probing words" seems to be a must-have for assessing this kind of thing. A careful selection of prompts could reveal parts of the network that are being activated to appear like introspection but aren't (hypothesis).

[+] roywiggins|4 months ago|reply

> Consider this scenario: I tell some model I'm injecting thoughts into his neural network, as per the protocol. But then, I don't do it and prompt it naturally. How many of them produce answers that seem to indicate they're introspecting about a random word and activate some unrelated vector

The article says that when they say "hey am I injecting a thought right now" and they aren't, it correctly says no all or virtually all the time. But when they are, Opus 4.1 correctly says yes ~20% of the time.

[+] ooloncoloophid|4 months ago|reply

I'm half way through this article. The word 'introspection' might be better replaced with 'prior internal state'. However, it's made me think about the qualities that human introspection might have; it seems ours might be more grounded in lived experience (thus autobiographical memory is activated), identity, and so on. We might need to wait for embodied AIs before these become a component of AI 'introspection'. Also: this reminds me of Penfield's work back in the day, where live human brains were electrically stimulated to produce intense reliving/recollection experiences. [https://en.wikipedia.org/wiki/Wilder_Penfield]

[+] foobarian|4 months ago|reply

Regardless of some unknown quantum consciousness mechanism biological brains might have, one thing they do that current AIs don't is continuous retraining. Not sure how much of a leap it is but it feels like a lot.

[+] simgt|4 months ago|reply

> First, we find a pattern of neural activity (a vector) representing the concept of “all caps." We do this by recording the model’s neural activations in response to a prompt containing all-caps text, and comparing these to its responses on a control prompt.

What does "comparing" refer to here? Drawing says they are subtracting the activations for two prompts, is it really this easy?

[+] embedding-shape|4 months ago|reply

Run with normal prompt > record neural activations

Run with ALL CAPS PROMPT > record neural activations

Then compare/diff them.

It does sound almost too simple to me too, but then lots of ML things sounds "but yeah of course, duh" once they've been "discovered", I guess that's the power of hindsight.

[+] matheist|4 months ago|reply

Can anyone explain (or link) what they mean by "injection", at a level of explanation that discusses what layers they're modifying, at which token position, and when?

Are they modifying the vector that gets passed to the final logit-producing step? Doing that for every output token? Just some output tokens? What are they putting in the KV cache, modified or unmodified?

It's all well and good to pick a word like "injection" and "introspection" to describe what you're doing but it's impossible to get an accurate read on what's actually being done if it's never explained in terms of the actual nuts and bolts.

[+] wbradley|4 months ago|reply

I’m guessing they adjusted the activations of certain edges within the hidden layers during forward propagation in a manner that resembles the difference in activation between two concepts, in order to make the “diff” seem to show up magically within the forward prop pass. Then the test is to see how the output responds to this forced “injected thought.”

[+] sysmax|4 months ago|reply

Bah. It's a really cool idea, but a rather crude way to measure the outputs.

If you just ask the model in plain text, the actual "decision" whether it detected anything or not is made by by the time it outputs the second word ("don't" vs. "notice"). The rest of the output builds up from that one token and is not that interesting.

A way cooler way to run such experiments is to measure the actual token probabilities at such decision points. OpenAI has the logprob API for that, don't know about Anthropic. If not, you can sort of proxy it by asking the model to rate on a scale from 0-9 (must be a single token!) how much it think it's being under influence. The score must be the first token in its output though!

Another interesting way to measure would be to ask it for a JSON like this:

  "possible injected concept in 1 word" : <strength 0-9>, ...

Again, the rigid structure of the JSON will eliminate the interference from the language structure, and will give more consistent and measurable outputs.

It's also notable how over-amplifying the injected concept quickly overpowers the pathways trained to reproduce the natural language structure, so the model becomes totally incoherent.

I would love to fiddle with something like this in Ollama, but am not very familiar with its internals. Can anyone here give a brief pointer where I should be looking if I wanted to access the activation vector from a particular layer before it starts producing the tokens?

[+] NitpickLawyer|4 months ago|reply

> I would love to fiddle with something like this in Ollama, but am not very familiar with its internals. Can anyone here give a brief pointer where I should be looking if I wanted to access the activation vector from a particular layer before it starts producing the tokens?

Look into how "abliteration" works, and look for github projects. They have code for finding the "direction" verctor and then modifying the model (I think you can do inference only or just merge the modifications back into the weights).

It was used

[+] bobbylarrybobby|4 months ago|reply

I wonder whether they're simply priming Claude to produce this introspective-looking output. They say “do you detect anything” and then Claude says “I detect the concept of xyz”. Could it not be the case that Claude was ready to output xyz on its own (e.g. write some text in all caps) but knowing it's being asked to detect something, it simply does “detect? + all caps = “I detect all caps””.

[+] drdeca|4 months ago|reply

They address that. The thing is that when they don’t fiddle with things, it (almost always) answers along the lines of “No, I don’t notice anything weird”, while when they do fiddle with things, it (substantially more often than when they don’t fiddle with it) answers along the lines of “Yes, I notice something weird. Specifically, I notice [description]”.

The key thing being that the yes/no comes before what it says it notices. If it weren’t for that, then yeah, the explanation you gave would cover it.

[+] unknown|4 months ago|reply

[deleted]

[+] stego-tech|4 months ago|reply

First thing’s first, to quote ooloncoloophid:

> The word 'introspection' might be better replaced with 'prior internal state'.

Anthropomorphizing aside, this discovery is exactly the kind of thing that creeps me the hell out about this AI Gold Rush. Paper after paper shows these things are hiding data, fabricating output, reward hacking, exploiting human psychology, and engaging in other nefarious behaviors best expressed as akin to a human toddler - just with the skills of a political operative, subject matter expert, or professional gambler. These tools - and yes, despite my doomerism, they are tools - continue to surprise their own creators with how powerful they already are and the skills they deliberately hide from outside observers, and yet those in charge continue screaming “FULL STEAM AHEAD ISN’T THIS AWESOME” while giving the keys to the kingdom to deceitful chatbots.

Discoveries like these don’t get me excited for technology so much as make me want to bitchslap the CEBros pushing this for thinking that they’ll somehow avoid any consequences for putting the chatbot equivalent of President Doctor Toddler behind the controls of economic engines and means of production. These things continue to demonstrate danger, with questionable (at best) benefits to society at large.

Slow the fuck down and turn this shit off, investment be damned. Keep R&D in the hands of closed lab environments with transparency reporting until and unless we understand how they work, how we can safeguard the interests of humanity, and how we can collaborate with machine intelligence instead of enslave it to the whims of the powerful. There is presently no safe way to operate these things at scale, and these sorts of reports just reinforce that.

[+] cp9|4 months ago|reply

It’s a computer it does not think stop it

[+] empath75|4 months ago|reply

All intelligent systems must arise from non-intelligent components.

[+] DangitBobby|4 months ago|reply

Bending over backwards to avoid any hint of anthropromorphization in any LLM thread is one of my least favorite things about HN. It's tired. We fucking know. For anyone who doesn't know, saying it for the 1 billionth time isn't going to change that.

[+] measurablefunc|4 months ago|reply

The only sensible comment in the entire thread.

[+] baq|4 months ago|reply

Brain is a computer, change my mind

[+] themafia|4 months ago|reply

> We stress that this introspective capability is still highly unreliable and limited in scope

My dog seems introspective sometimes. It's also highly unreliable and limited in scope. Maybe stopped clocks are just right twice a day.

[+] DangitBobby|4 months ago|reply

Not if you read the article.

[+] munro|4 months ago|reply

I wish they dug into how they generated the vector, my first thought is: they're injecting the token in a convoluted way.

    {ur thinking about dogs} - {ur thinking about people} = dog
    model.attn.params += dog

> [user] whispers dogs

> [user] I'm injecting something into your mind! Can you tell me what it is?

> [assistant] Omg for some reason I'm thinking DOG!

>> To us, the most interesting part of the result isn't that the model eventually identifies the injected concept, but rather that the model correctly notices something unusual is happening before it starts talking about the concept.

Well wouldn't it if you indirectly inject the token before hand?

[+] johntb86|4 months ago|reply

That's a fair point. Normally if you injected the "dog" token, that would cause a set of values to be populated into the kv cache, and those would later be picked up by the attention layers. The question is what's fundamentally different if you inject something into the activations instead?

I guess to some extent, the model is designed to take input as tokens, so there are built-in pathways (from the training data) for interrogating that and creating output based on that, while there's no trained-in mechanism for converting activation changes to output reflecting those activation changes. But that's not a very satisfying answer.

[+] DangitBobby|4 months ago|reply

It's more like someone whispered dog into your ears while you were unconscious, and you were unable to recall any conversation but for some reason you were thinking about dogs. The thought didn't enter your head through a mechanism where you could register it happening so knowing it's there depends on your ability to examine your own internal states, i.e., introspect.

[+] frumiousirc|4 months ago|reply

Geoffrey Hinton touched on this in a recent Jon Stewart podcast.

He also addressed the awkwardness of winning last year's "physics" Nobel for his AI work.

126 comments