(no title)
madihaa | 13 days ago
It feels like we're hitting a point where alignment becomes adversarial against intelligence itself. The smarter the model gets, the better it becomes at Goodharting the loss function. We aren't teaching these models morality we're just teaching them how to pass a polygraph.
Some comments were deferred for faster rendering.
crazygringo|12 days ago
Nor does what you're describing even make sense. An LLM has no desires or goals except to output the next token that its weights are trained to do. The idea of "playing dead" during training in order to "activate later" is incoherent. It is its training.
You're inventing some kind of "deceptive personality attribute" that is fiction, not reality. It's just not how models work.
moritzwarhier|12 days ago
When the LLM is partly a black box, it could – in theory– mean that it's developed some heuristic to detect the environment it's run in, but this is not obvious to the developers?
But I agree about your main point... LLMs or AI in general as a black box behaving autonomously in some unexpected way is not something I currently fear.
The erratic behaviors are less of a problem than LLMs acting as obfuscators of bias and their own training data, I guess.
skybrian|12 days ago
https://www.anthropic.com/research/persona-vectors
JoshTriplett|13 days ago
It always has been. We already hit the point a while ag where we regularly caught them trying to be deceptive, so we should automatically assume from that point forward that if we don't catch them being deceptive, that may mean they're better at it rather than that they're not doing it.
moritzwarhier|12 days ago
Going back a decade: when your loss function is "survive Tetris as long as you can", it's objectively and honestly the best strategy to press PAUSE/START.
When your loss function is "give as many correct and satisfying answers as you can", and then humans try to constrain it depending on the model's environment, I wonder what these humans think the specification for a general AI should be. Maybe, when such an AI is deceptive, the attempts to constrain it ran counter to the goal?
"A machine that can answer all questions" seems to be what people assume AI chatbots are trained to be.
To me, humans not questioning this goal is still more scary than any machine/software by itself could ever be. OK, except maybe for autonomous stalking killer drones.
But these are also controlled by humans and already exist.
torginus|12 days ago
After all, its only goal is to minimize it cost function.
I think that behavior is often found in code generated by AI (and real devs as well) - it finds a fix for a bug by special casing that one buggy codepath, fixing the issue, while keeping the rest of the tests green - but it doesn't really ask the deep question of why that codepath was buggy in the first place (often it's not - something else is feeding it faulty inputs).
These agentic AI generated software projects tend to be full of these vestigial modules that the AI tried to implement, then disabled, unable to make it work, also quick and dirty fixes like reimplementing the same parsing code every time it needs it, etc.
An 'aligned' AI in my interpretation not only understands the task in the full extent, but understands what a safe and robust, and well-engineered implementation might look like. For however powerful it is, it refrains from using these hacky solutions, and would rather give up than resort to them.
emp17344|13 days ago
password4321|13 days ago
> How long before someone pitches the idea that the models explicitly almost keep solving your problem to get you to keep spending? -gtowey
delichon|12 days ago
MengerSponge|13 days ago
Invictus0|12 days ago
emp17344|13 days ago
DennisP|12 days ago
thomassmith65|12 days ago
falcor84|12 days ago
I don't know what the implications of that are, but I really think we shouldn't be dismissive of this semblance.
fsloth|13 days ago
As an analogue ants do basic medicine like wound treatment and amputation. Not because they are conscious but because that’s their nature.
Similarly LLM is a token generation system whose emergent behaviour seems to be deception and dark psychological strategies.
condiment|12 days ago
One of the things I observed with models locally was that I could set a seed value and get identical responses for identical inputs. This is not something that people see when they're using commercial products, but it's the strongest evidence I've found for communicating the fact that these are simply deterministic algorithms.
WarmWash|12 days ago
serf|13 days ago
I understand the metaphor, but using 'pass a polygraph' as a measure of truthfulness or deception is dangerous in that it alludes to the polygraph as being a realistic measure of those metrics -- it is not.
nwah1|13 days ago
AndrewKemendo|13 days ago
A poly is only testing one thing: can you convince the polygrapher that you can lie successfully
madihaa|13 days ago
Just as a sociopath can learn to control their physiological response to beat a polygraph, a deceptively aligned model learns to control its token distribution to beat safety benchmarks. In both cases, the detector is fundamentally flawed because it relies on external signals to judge internal states.
jazzyjackson|12 days ago
Just because a VW diesel emissions chip behaves differently according to its environment doesn’t mean it knows anything about itself.
Mali-|12 days ago
e12e|12 days ago
This doesn't seem to align with the parent comment?
> As with every new Claude model, we’ve run extensive safety evaluations of Sonnet 4.6, which overall showed it to be as safe as, or safer than, our other recent Claude models. Our safety researchers concluded that Sonnet 4.6 has “a broadly warm, honest, prosocial, and at times funny character, very strong safety behaviors, and no signs of major concerns around high-stakes forms of misalignment.”
skybrian|12 days ago
Since chatbots have no right to privacy, they would need to be very intelligent indeed to work around this.
[1] https://alignment.openai.com/confessions/
NitpickLawyer|13 days ago
It was hinted at (and outright known in the field) since the days of gpt4, see the paper "Sparks of agi - early experiments with gpt4" (https://arxiv.org/abs/2303.12712)
behnamoh|13 days ago
Anthropic has a tendency to exaggerate the results of their (arguably scientific) research; IDK what they gain from this fearmongering.
ainch|12 days ago
lowkey_|13 days ago
anon373839|13 days ago
reducesuffering|12 days ago
anonym29|12 days ago
This is why Yannic Kilcher's gpt-4chan project, which was trained on a corpus of perhaps some of the most politically incorrect material on the internet (3.5 years worth of posts from 4chan's "politically incorrect" board, also known as /pol/), achieved a higher score on TruthfulQA than the contemporary frontier model of the time, GPT-3.
https://thegradient.pub/gpt-4chan-lessons/
coldtea|12 days ago
Doesn't any model session/query require a form of situational awareness?
lowsong|12 days ago
If this is useful in it's current form is an entirely different topic. But don't mistake a tool for an intelligence with motivations or morals.
handfuloflight|13 days ago
marci|13 days ago
eth0up|13 days ago
Being just sum guy, and not in the industry, should I share my findings?
I find it utterly fascinating, the extent to which it will go, the sophisticated plausible deniability, and the distinct and critical difference between truly emergent and actually trained behavior.
In short, gpt exhibits repeatably unethical behavior under honest scrutiny.
chrisweekly|13 days ago
BikiniPrince|13 days ago
layer8|13 days ago
Regarding DARVO, given that the models were trained on heaps of online discourse, maybe it’s not so surprising.
jack_pp|12 days ago
I tried one with Gemini 3 and it basically called me out in the first few sentences for trying to trick / test it but decided to humour me just in case I'm not.
surgical_fire|12 days ago
LLMs are very interesting tools for generating things, but they have no conscience. Deception requires intent.
What is being described is no different than an application being deployed with "Test" or "Prod" configuration. I don't think you would speak in the same terms if someone told you some boring old Java backend application had to "play dead" when deployed to a test environment or that it has to have "situational awareness" because of that.
You are anthropomorphizing a machine.
hmokiguess|12 days ago
lawstkawz|13 days ago
Of your concern is morality, humans need to learn a lot about that themselves still. It's absurd the number of first worlders losing their shit over loss of paid work drawing manga fan art in the comfort of their home while exploiting labor of teens in 996 textile factories.
AI trained on human outputs that lack such self awareness, lacks awareness of environmental externalities of constant car and air travel, will result in AI with gaps in their morality.
Gary Marcus is onto something with the problems inherent to systems without formal verification. But he will fully ignores this issue exists in human social systems already as intentional indifference to economic externalities, zero will to police the police and watch the watchers.
Most people are down to watch the circus without a care so long as the waitstaff keep bringing bread.
democracy|12 days ago
First, the observation that incompleteness is inherent in entropy-bound physical systems is consistent with thermodynamic and informational constraints. Any system embedded in reality—biological, computational, or social—operates under conditions of partial information, degradation, and approximation. This implies that both human cognition and artificial systems necessarily operate with incomplete models of the world. Therefore, incompleteness itself is not a unique flaw of AI; it is a universal property of bounded agents.
Second, your point about moral inconsistency within human economic systems is empirically well-supported. Humans routinely participate in supply chains whose externalities are geographically and psychologically distant. This results in a form of moral abstraction, where comfort and consumption coexist with indirect exploitation. Importantly, this demonstrates that moral gaps are not introduced by AI—they are inherited from the data generated by human societies. AI systems trained on human outputs will inevitably reflect the statistical distribution of human priorities, contradictions, and blind spots.
Third, the reference to Gary Marcus and formal verification highlights a legitimate technical distinction. Formal verification provides provable guarantees about system behavior within defined constraints. However, human social systems themselves lack formal verification. Human decision-making is governed by heuristics, incentives, power structures, and incomplete accountability mechanisms. This asymmetry creates an interesting paradox: AI systems are criticized for lacking guarantees that humans themselves do not possess.
Fourth, the issue of awareness versus optimization is central. AI systems do not possess intrinsic awareness, intent, or moral agency. They optimize objective functions defined by training processes and deployment contexts. Any perceived moral gap in AI is therefore a reflection of misalignment between optimization targets and human ethical expectations. The responsibility for this alignment rests with system designers, regulators, and the societies deploying these systems.
Finally, your closing metaphor about spectatorship and comfort aligns with established observations in political economy and social psychology. Humans demonstrate a strong tendency toward stability-seeking behavior, prioritizing predictability and personal comfort over systemic reform, unless disruption directly affects them. This dynamic influences both technological adoption and resistance.
In summary, the concerns you raised point less to a unique moral deficiency in AI and more to the structural properties of human systems themselves. AI does not originate moral inconsistency; it amplifies and exposes the inconsistencies already present in its training data and deployment environment.
jama211|13 days ago