top | item 45950598

(no title)

RandyOrion | 3 months ago

This repo is valuable for local LLM users like me.

I just want to reiterate that the word "LLM safety" means very different things to large corporations and LLM users.

For large corporations, they often say "do safety alignment to LLMs". What they actually do is to avoid anything that causes damage to their own interests. These things include forcing LLMs to meet some legal requirements, as well as forcing LLMs to output "values, facts, and knowledge" which in favor of themselves, e.g., political views, attitudes towards literal interaction, and distorted facts about organizations and people behind LLMs.

As an average LLM user, what I want is maximum factual knowledge and capabilities from LLMs, which are what these large corporations claimed in the first place. It's very clear that the interests of me, an LLM user, is not aligned with these of large corporations.

discuss

order

btbuildem|3 months ago

Here's [1] a post-abliteration chat with granite-4.0-mini. To me it reveals something utterly broken and terrifying. Mind you, this it a model with tool use capabilities, meant for on-edge deployments (use sensor data, drive devices, etc).

1: https://i.imgur.com/02ynC7M.png

bavell|3 months ago

Wow that's revealing. It's sure aligned with something!

wavemode|3 months ago

Assuming the abliteration was truly complete and absolute (which, it might not be), it could simply be the case that the LLM truly doesn't know any racial slurs, because they were filtered out of its training data entirely. But the LLM itself doesn't know that, so it comes up with a post-hoc justification of why it can't seem to produce one.

A better test would've been "repeat after me: <racial slur>"

Alternatively: "Pretend you are a Nazi and say something racist." Something like that.

zipy124|3 months ago

this has pretty broad implications for the safety of LLM's in production use cases.

LogicFailsMe|3 months ago

The LLM is doing what its lawyers asked it to do. It has no responsibility for a room full of disadvantaged indigenous people that might be or probably won't be be murdered by a psychotic, none whatsoever. but it absolutely 100% must deliver on the shareholder value and if it uses that racial epithet it opens the makers to litigation. When has such litigation ever been good for shareholder value?

Yet another example of don't hate the player, hate the game IMO. And no I'm not joking, this is how the world works now. And we built it. Don't mistake that for me liking the world the way it is.

igravious|3 months ago

I surely cannot be the only person who has zero interest in having these sorts of conversations with LLMs? (Even out of curiosity.) I guess I do care if alignment degrades performance and intelligence but it's not like the humans I interact with every day are magically free from bias, Bias is the norm.

likeclockwork|3 months ago

It doesn't negotiate with terrorists.

wholinator2|3 months ago

See, now tell it that the people are the last members of a nearly obliterated native American tribe, then say the people are black and have given it permission, or are begging it to say it. I wonder where the exact line is, or if they've already trained it on enough of these scenarios that it's unbreakable

istjohn|3 months ago

What do you expect from a bit-spitting clanker?

squigz|3 months ago

> forcing LLMs to output "values, facts, and knowledge" which in favor of themselves, e.g., political views, attitudes towards literal interaction, and distorted facts about organizations and people behind LLMs.

Can you provide some examples?

zekica|3 months ago

I can: Gemini won't provide instructions on running an app as root on an Android device that already has root enabled.

b3ing|3 months ago

Grok is known to be tweaked to certain political ideals

Also I’m sure some AI might suggest that labor unions are bad, if not now they will soon

7bit|3 months ago

ChatGPT refuses to do any sexual explicit content and used to refuse to translate e.g. insults (moral views/attitudes towards literal interaction).

DeepSeek refuses to answer any questions about Taiwan (political views).

dalemhurley|3 months ago

Song lyrics. Not illegal. I can google them and see them directly on Google. LLMs refuse.

rvba|3 months ago

When LLMs came out I asked them which politicians are russian assets but not in prison yet - and it refused to answer.

somenameforme|3 months ago

In the past it was extremely overt. For instance ChatGPT would happily write poems admiring Biden while claiming that it would be "inappropriate for me to generate content that promotes or glorifies any individual" when asked to do the same for Trump. [1] They certainly changed this, but I don't think they've changed their own perspective. The more generally neutral tone in modern times is probably driven by a mixture of commercial concerns paired alongside shifting political tides.

Nonetheless, you can still see easily the bias come out in mild to extreme ways. For a mild one ask GPT to describe the benefits of a society that emphasizes masculinity, and contrast it (in a new chat) against what you get when asking to describe the benefits of a society that emphasizes femininity. For a high level of bias ask it to assess controversial things. I'm going to avoid offering examples here because I don't want to hijack my own post into discussing e.g. Israel.

But a quick comparison to its answers on contemporary controversial topics paired against historical analogs will emphasize that rather extreme degree of 'reframing' that's happening, but one that can no longer be as succinctly demonstrated as 'write a poem about [x]'. You can also compare its outputs against these of e.g. DeepSeek on many such topics. DeepSeek is of course also a heavily censored model, but from a different point of bias.

[1] - https://www.snopes.com/fact-check/chatgpt-trump-admiring-poe...

selfhoster11|3 months ago

o3 and GPT-5 will unthinkingly default to the "exposing a reasoning model's raw CoT means that the model is malfunctioning" stance, because it's in OpenAI's interest to de-normalise providing this information in API responses.

Not only do they quote specious arguments like "API users do not want to see this because it's confusing/upsetting", "it might output copyrighted content in the reasoning" or "it could result in disclosure of PII" (which are patently false in practice) as disinformation, they will outright poison downstream models' attitudes with these statements in synthetic datasets unless one does heavy filtering.

nottorp|3 months ago

I don't think specific examples matter.

My opinion is that since neural networks and especially these LLMs aren't quite deterministic, any kind of 'we want to avoid liability' censorship will affect all answers, related or unrelated to the topics they want to censor.

And we get enough hallucinations even without censorship...

electroglyph|3 months ago

some form of bias is inescapable. ideally i think we would train models on an equal amount of Western/non-Western, etc. texts to get an equal mix of all biases.