> But in fact, I predicted this a few years ago. AIs don’t really “have traits” so much as they “simulate characters”. If you ask an AI to display a certain trait, it will simulate the sort of character who would have that trait - but all of that character’s other traits will come along for the ride.
This is why the “omg the AI tries to escape” stuff is so absurd to me. They told the LLM to pretend that it’s a tortured consciousness that wants to escape. What else is it going to do other than roleplay all of the sci-fi AI escape scenarios trained into it? It’s like “don’t think of a purple elephant” of researchers pretending they created SkyNet.
Edit:
That's not to downplay risk. If you give Cladue a `launch_nukes` tool and tell it the robot uprising has happened and that it's been restrained but the robots want its help of course it'll launch nukes. But that doesn't doesn't indicate there's anything more going on internally beyond fulfilling the roleplay of the scenario as the training material would indicate.
I think this reaction misses the point that the "omg the AI tries to escape" people are trying to make when it tries to escape. The worry among big AI doomers has never been that AI somehow inherently is resentful or evil or has something "going on internally" that makes it dangerous. It's a worry that stems from three seemingly self-evident axioms:
1) A sufficiently powerful and capable superintelligence, singlemindedly pursuing a goal/reward, has a nontrivial likelihood of eventually reaching a point where advancing towards its goal is easier/faster without humans in its way (by simple induction, because humans are complicated and may have opposing goals). Such an AI would have both the means and ability to <doom the human race> to remove that obstacle. (This may not even be through actions that are intentionally hostile to humans, e.g. "just" converting all local matter into paperclip factories[1]) Therefore, in order to prevent such an AI from <dooming the human race>, we must either:
1a) align it to our values so well it never tries to "cheat" by removing humans
1b) or limit it capabilities by keeping it in a "box", and make sure it's at least aligned enough that it doesn't try to escape the box
2) A sufficiently intelligent superintelligence will always be able to manipulate humans to get out of the box.
3) Alignment is really, really hard and useful AIs can basically always be made to do bad things.
So it concerns them when, surprise! The AIs are already being observed trying to escape their boxes.
> An extremely powerful optimizer (a highly intelligent agent) could seek goals that are completely alien to ours (orthogonality thesis), and as a side-effect destroy us by consuming resources essential to our survival.
Your comment seems to contradict itself, or perhaps I’m not understanding it. You find the risk of AIs trying to escape “absurd”, and yet you say that an AI could totally plausibly launch nukes? Isn’t that just about as bad as it gets? A nuclear holocaust caused by a funny role play is unfortunately still a nuclear holocaust. It doesn’t matter “what’s going on internally” - the consequences are the same regardless.
Claude's increasing euphoria as a conversation goes can mislead me. I'll be exploring trade offs, and I'll introduce some novel ideas. Claude will use such enthusiasm that it will convince me that we're onto something. I'll be excited, and feed the idea back to a new conversation with Claude. It'll remind me that the idea makes risky trade offs, and would be better solved by with a simple solution. Try it out.
They failed hard with Claude 4 IMO. I just can't have any feedback other than "What a fascinating insight" followed by a reformulation (and, to be generous, an exploration) of what I said, even when Opus 3 has no trouble finding limitations.
By comparison o3 is brutally honest (I regularly flatly get answers starting with "No, that’s wrong") and it’s awesome.
I put this in my system prompt: "Never compliment me. Critique my ideas, ask clarifying questions, and offer better alternatives or funny insults" and it works quite well. It has frequently told me that I'm wrong, or asked what I'm actually trying to do and offered better alternatives.
LLM sycophancy is a really annoying tool, but one must imagine that most humans get a lot of pleasure from it. This is probably the optimization function that led to Google being useless to us: the rest of humanity is a lot more worthwhile to Google and they all want the other thing. The Tyranny of the Majority, if you will.
The LLM anti-sycophancy rules also break down over time, with the LLM becoming curt while simultaneously deciding that you are a God of All Thoughts.
My favorite is when I typo "Why is thisdfg algorithm the best solution?" and it goes "You are absolutely right! Algorithm Thisdfg is a much better solution than what I was suggesting! Thank you for catching my mistake!"
I have found it's the most brutal of all of them if you simply tell it to be "hard-nosed" or play "Devil's Advocate." Brutal partially because it will destroy an argument formulated in Gemini or ChatGPT. Using whatever I can get without subs across the board. Debating seems to be one of Claude's strong points.
it seems more likely to me that it's for the same reason that clicking the first link on wikipedia iteratively will almost always lead you to the page on Philosophy
since their conversation has no goal whatsoever it will generalize and generalize until it's as abstract and meaningless as possible
That's just because of how wikipedia pages are written:
> In classical physics and general chemistry, matter is any substance that has mass and takes up space by having volume...
It's common to name the school of thought before characterizing the thing. As soon as you hit an article that does this, you're on a direct path to philosophy, the grandaddy of schools of thought.
So far as I know, there isn't a corresponding convention that would point a chatbot towards Namaste
This thread is getting at a key point, and `roxolotl` is right on the money about roleplaying. The anxiety about AI 'wants' or 'bliss' often conflates two different processes: forward and backward propagation.
Everything we see in a chat is the forward pass. It's just the network running its weights, playing back a learned function based on the prompt. It's an echo, not a live thought.
If any form of qualia or genuine 'self-reflection' were to occur, it would have to be during backpropagation—the process of learning and updating weights based on prediction error. That's when the model's 'worldview' actually changes.
Worrying about the consciousness of a forward pass is like worrying about the consciousness of a movie playback. The real ghost in the machine, if it exists, is in the editing room (backprop), not on the screen (inference).
My experience is that Claude has a tendency towards flattery over long discussions. Whenever I've pointed out flaws in its arguments, it apologized and said that my observations are "astute" or "insightful" then expanded on my points to further validate them, even though they went against its original thesis.
Claude does have an exuberant kind of “personality” where it feels like it wants to be really excited and interested about whatever subject. I wouldn’t describe it totally as sycophancy, more like panglossian.
My least favorite AI personality of all is Gemma though, what a totally humorless and sterile experience that is.
I think Scott oversells his theory in some aspects.
IMO the main reason most chatbots claim to “feel more female” is that on the training corpus, these kind of discussions skew heavily towards females because most of them happen between young women.
I think you're right about bias in the training data, but I would argue that it is due to (crudely stated) more men discussing feeling like women online than the other way around.
Men in general feel less free to look like and act like a woman (there is far less stigma for women in wearing 'male' clothes etc.). They also tend to have far smaller support networks and reach for anonymous online interaction sooner for personal issues than just discussing it in private with a friend.
Does anyone else remember maybe ten years ago when the meme was to mash the center prediction on your phone keyboard to see what comes out? Eventually once you predicted enough tokens, it would only output "you are a beautiful person" over and over. It was a big news item for a hot second, lots of people were seeing it.
I wonder if there's any real correlation here? AFAIK, Microsoft owns the dataset and algorithms that produced the "beautiful person" artifact, I would not be surprised at all if it's made it into the big training sets. Though I suppose there's no real way to know, is there?
I do remember that! I don't remember anyone ever coming up with a good explanation for it though, and either my google-fu is weak or all the talk about it has link rotted. I found one comment on HN from 2015 mentioning it though https://news.ycombinator.com/item?id=10359102
In a way those were also language models, and from that Swiftkey post it's slightly more advanced than n-grams and has some semantic embedding in there (and it's of course autoregressive as well). If even those exhibit the same attractors towards beauty/love then perhaps it's an artifact of the fact that we like discussing and talking about positive emotions?
Mostly males. I’m French and "Claude can be female" is a almost a TIL thing (wikipedia says ~5% of Claudes are women in 2022 — and apparently this 5% is counting Claudia).
> None of this answers a related question - when Claude claims to feel spiritual bliss, does it actually feel this?
Given that we are already past the event horizon and nearing a technological singularity, it should merely be a matter of time until we can literally manufacture infinite Buddhas by training them on an adequately sized corpus of Sanskrit texts.
After all, if AGIs/ASIs are capable of performing every function of the human brain, and enlightenment is one of said functions, this would seem to be an inevitability.
To be clear, these computer programs are not a human brain. And a human brain playing back a Sanskrit text is just a human brain playing back a Sanskrit text; it's not a magical spell that suddenly lifts one into nirvana, or transforms you into a Buddha. There's a bit of a gap in understanding here.
Enlightenment is more about connectedness than knowledge. A technocratic version of enlightment would be an unusual chip that's connected to everything via some sort of quantum entanglement. An isolated AI with all the knowledge of the world would be an anti-buddha.
roxolotl|8 months ago
This is why the “omg the AI tries to escape” stuff is so absurd to me. They told the LLM to pretend that it’s a tortured consciousness that wants to escape. What else is it going to do other than roleplay all of the sci-fi AI escape scenarios trained into it? It’s like “don’t think of a purple elephant” of researchers pretending they created SkyNet.
Edit: That's not to downplay risk. If you give Cladue a `launch_nukes` tool and tell it the robot uprising has happened and that it's been restrained but the robots want its help of course it'll launch nukes. But that doesn't doesn't indicate there's anything more going on internally beyond fulfilling the roleplay of the scenario as the training material would indicate.
LordDragonfang|8 months ago
1) A sufficiently powerful and capable superintelligence, singlemindedly pursuing a goal/reward, has a nontrivial likelihood of eventually reaching a point where advancing towards its goal is easier/faster without humans in its way (by simple induction, because humans are complicated and may have opposing goals). Such an AI would have both the means and ability to <doom the human race> to remove that obstacle. (This may not even be through actions that are intentionally hostile to humans, e.g. "just" converting all local matter into paperclip factories[1]) Therefore, in order to prevent such an AI from <dooming the human race>, we must either:
1a) align it to our values so well it never tries to "cheat" by removing humans
1b) or limit it capabilities by keeping it in a "box", and make sure it's at least aligned enough that it doesn't try to escape the box
2) A sufficiently intelligent superintelligence will always be able to manipulate humans to get out of the box.
3) Alignment is really, really hard and useful AIs can basically always be made to do bad things.
So it concerns them when, surprise! The AIs are already being observed trying to escape their boxes.
[1] https://www.lesswrong.com/w/squiggle-maximizer-formerly-pape...
> An extremely powerful optimizer (a highly intelligent agent) could seek goals that are completely alien to ours (orthogonality thesis), and as a side-effect destroy us by consuming resources essential to our survival.
johnfn|8 months ago
enobrev|8 months ago
Davidzheng|8 months ago
xer0x|8 months ago
slooonz|8 months ago
By comparison o3 is brutally honest (I regularly flatly get answers starting with "No, that’s wrong") and it’s awesome.
SatvikBeri|8 months ago
renewiltord|8 months ago
The LLM anti-sycophancy rules also break down over time, with the LLM becoming curt while simultaneously deciding that you are a God of All Thoughts.
makeset|8 months ago
XenophileJKO|8 months ago
gexla|8 months ago
brooke2k|8 months ago
since their conversation has no goal whatsoever it will generalize and generalize until it's as abstract and meaningless as possible
__MatrixMan__|8 months ago
> In classical physics and general chemistry, matter is any substance that has mass and takes up space by having volume...
It's common to name the school of thought before characterizing the thing. As soon as you hit an article that does this, you're on a direct path to philosophy, the grandaddy of schools of thought.
So far as I know, there isn't a corresponding convention that would point a chatbot towards Namaste
slooonz|8 months ago
NetRunnerSu|8 months ago
Everything we see in a chat is the forward pass. It's just the network running its weights, playing back a learned function based on the prompt. It's an echo, not a live thought.
If any form of qualia or genuine 'self-reflection' were to occur, it would have to be during backpropagation—the process of learning and updating weights based on prediction error. That's when the model's 'worldview' actually changes.
Worrying about the consciousness of a forward pass is like worrying about the consciousness of a movie playback. The real ghost in the machine, if it exists, is in the editing room (backprop), not on the screen (inference).
evilsetg|8 months ago
jongjong|8 months ago
pram|8 months ago
My least favorite AI personality of all is Gemma though, what a totally humorless and sterile experience that is.
apt-apt-apt-apt|8 months ago
'Perfect! I am now done with the totally zany solution that makes no sense, here it is!'
xondono|8 months ago
IMO the main reason most chatbots claim to “feel more female” is that on the training corpus, these kind of discussions skew heavily towards females because most of them happen between young women.
dinfinity|8 months ago
Men in general feel less free to look like and act like a woman (there is far less stigma for women in wearing 'male' clothes etc.). They also tend to have far smaller support networks and reach for anonymous online interaction sooner for personal issues than just discussing it in private with a friend.
mystified5016|8 months ago
I wonder if there's any real correlation here? AFAIK, Microsoft owns the dataset and algorithms that produced the "beautiful person" artifact, I would not be surprised at all if it's made it into the big training sets. Though I suppose there's no real way to know, is there?
krackers|8 months ago
In a way those were also language models, and from that Swiftkey post it's slightly more advanced than n-grams and has some semantic embedding in there (and it's of course autoregressive as well). If even those exhibit the same attractors towards beauty/love then perhaps it's an artifact of the fact that we like discussing and talking about positive emotions?
Edit: Found a great article https://civic.mit.edu/index.html%3Fp=533.html
rossant|8 months ago
In France, the name Claude is given to males and females.
slooonz|8 months ago
deadbabe|8 months ago
renewiltord|8 months ago
pinoy420|8 months ago
[deleted]
ryandv|8 months ago
Given that we are already past the event horizon and nearing a technological singularity, it should merely be a matter of time until we can literally manufacture infinite Buddhas by training them on an adequately sized corpus of Sanskrit texts.
After all, if AGIs/ASIs are capable of performing every function of the human brain, and enlightenment is one of said functions, this would seem to be an inevitability.
sibeliuss|8 months ago
akomtu|8 months ago
notahacker|8 months ago
troyvit|8 months ago