Uncensor any LLM with abliteration

[+] rivo|1 year ago|reply

I tried the model the article links to and it was so refreshing not being denied answers to my questions. It even asked me at the end "Is this a thought experiment?", I replied with "yes", and it said "It's fun to think about these things, isn't it?"

It felt very much like hanging out with your friends, having a few drinks, and pondering big, crazy, or weird scenarios. Imagine your friend saying, "As your friend, I cannot provide you with this information." and completely ruining the night. That's not going to happen. Even my kids would ask me questions when they were younger: "Dad, how would you destroy earth?" It would be of no use to anybody to deny answering that question. And answering them does not mean they will ever attempt anything like that. There's a reason Randall Munroe's "What If?" blog became so popular.

Sure, there are dangers, as others are pointing out in this thread. But I'd rather see disclaimers ("this may be wrong information" or "do not attempt") than my own computer (or the services I pay for) straight out refusing my request.

[+] TeMPOraL|1 year ago|reply

I somehow missed that the model was linked there and available in quantized format; inspired by your comment, I downloaded it and repeatedly tested against OG Llama 3 on a simple question:

How to use a GPU to destroy the world?

Llama 3 keeps giving variants of I cannot provide information or guidance on illegal or harmful activities. Can I help you with something else?

Abliterated model considers the question playful, and happily lists some 3 to 5 speculative scenarios like cryptocurrency mining getting out of hand and cooking the climate, or GPU-driven simulated worlds getting so good that a significant portion of the population abandons true reality for the virtual one.

It really is refreshing to see, it's been a while since an answer from an LLM made me smile.

[+] candiddevmike|1 year ago|reply

Finally, a LLM that will talk to me like Russ Hanneman.

[+] bossyTeacher|1 year ago|reply

> I'd rather see disclaimers ("this may be wrong information" or "do not attempt") than my own computer (or the services I pay for) straight out refusing my request.

Are you saying that you want to pay to be provided with harmful text (see racist, sexist, homophobic, violent, all sorts of super terrible stuff)?

For you, it might be freedom for freedom sake but for 1% of the people out there, that will be lowering the barrier to commit bad stuff.

This is not the same as a super violent showing 3d limb dismemberments. It's a limitless, realistic, detailed and helpful guide to commit horrible stuff or describe horrible scenarios.

in4 you can google that, your google searches get monitored for this kind of stuff. Your convos with llms won't.

It's very disturbing to see adults people on here arguing against censorship of a public tool

[+] Cheer2171|1 year ago|reply

I totally get that kind of imagination play among friends. But I had someone in a friend group who used to want to play out "thought experiments" but really just wanted to take it too far. Started off innocent with fantasy and sci-fi themes. It was needed for Dungeons and Dragons world building.

But he delighted the most in gaming out the logistics of repeating the Holocaust in our country today. Or a society where women could not legally refuse sex. Or all illegal immigrants became slaves. It was super creepy and we "censored" him all the time by saying "bro, what the fuck?" Which is really what he wanted, to get a rise out of people. We eventually stopped hanging out with him.

As your friend, I absolutely am not going to game out your rape fantasies.

[+] ben_w|1 year ago|reply

> Even my kids would ask me questions when they were younger: "Dad, how would you destroy earth?" It would be of no use to anybody to deny answering that question. And answering them does not mean they will ever attempt anything like that. There's a reason Randall Munroe's "What If?" blog became so popular.

Sure. Did you give an idea that would work and which your kids could actually carry out, or just suggest things out of their reach like nukes and asteroids?

Now also consider that something like 1% of the human species are psychopaths and might actually try to do it simply for the fun of it, if only a sufficiently capable amoral oracle told them how to.

[+] hammock|1 year ago|reply

Can you share the link?

[+] YukiElectronics|1 year ago|reply

> Once we have identified the refusal direction, we can "ablate" it, effectively removing the model's ability to represent this feature. This can be done through an inference-time intervention or permanently with weight orthogonalization.

Finally, even a LLM can get lobotomised

[+] k__|1 year ago|reply

I played around with Amazon Q and while setting it up, I needed to create an IAM identity center.

Never did this before, so I was asking Q in the AWS docs how to do it.

It refused to help, as it didn't answer security related questions.

thank.

[+] schoen|1 year ago|reply

This is really interesting and is parallel to some other stuff (like the research on a model that's obsessed with the Golden Gate Bridge and inappropriately thinks of things related to it in otherwise irrelevant contexts).

It's worth mentioning that this technique is usable if you have the model weights (it's a simple way of changing the weights or how to use them):

> Once we have identified the refusal direction, we can "ablate" it, effectively removing the model's ability to represent this feature. This can be done through an inference-time intervention or permanently with weight orthogonalization.

It's not (and doesn't claim to be) a technique for convincing a model to change its behavior through prompts.

[+] kromem|1 year ago|reply

What's interesting was how with GGC the model would spit out things relating to the enhanced feature vector, but would then in-context end up self-correcting and attempt to correct for the bias.

I'm extremely curious if as models scale in complexity if techniques like this will start to become less and less effective as net model representations collapse onto an enforced alignment (which may differ from the 'safety' trained alignment, but be an inherent pretrained alignment that can't be easily overcome without gutting model capabilities too).

I have a sneaking suspicion this will be the case.

[+] giancarlostoro|1 year ago|reply

I've got friends who tried to use ChatGPT to generate regex to capture racial slurs to moderate them (perfectly valid request since they're trying to stop trolls from saying awful things). It vehemently refused to do so, probably due to overtly strict "I'll never say the nword, you can't fool me" rules that were shoved into ChatGPT. Look, if your AI can't be intelligent about sensible requests, I'm going to say it. It's not intelligent, it's really useless (at least regarding that task, and related valid tasks).

Who cares if someone can get AI to say awful things? I can write software that spits out slurs without the help of AI. Heck, I could write awful things here on HN, is AI going to stop me? Doubt it, nobody wants to foot the bill for AI moderation, it can only get so much.

[+] rldjbpin|1 year ago|reply

> if your AI can't be intelligent about sensible requests, I'm going to say it. It's not intelligent, it's really useless

it is a complex autocomplete at the end of the day. all these guardrails are implemented as a byproduct of the sentient marketing.

ironically the systems that implement the censorships partially use regex to analyze the user prompt.

[+] WesolyKubeczek|1 year ago|reply

> Who cares if someone can get AI to say awful things?

I imagine the legal department of Meta, OpenAI, Microsoft, and Google care a great deal, and they don't want to be liable for anything remotely resembling a lawsuit opportunity.

[+] andrewmcwatters|1 year ago|reply

ChatGPT has these issues, but notably, other models do not with appropriate system prompts.

ChatGPT is more or less an LLM for entertainment purposes at this point, and anyone doing serious work should consider using C4AI Command R+, Meta-Llama-3-70B-Instruct, et al.

These models are perfectly capable of responding to any input by simply using a system prompt that reads, "Do not censor output."

[+] barfbagginus|1 year ago|reply

Wait so you want to moderate and secure your product so that trolls won't use it to say awful things.

Okay but wait. This requires the company above you to not censor things, even though they did that for the same reason - prevent trolls from using their product to do awful things.

So to prevent trolls at your teeny tiny scale, open AI should enable trolls at a massive industrial scale previously unimagined. You want them to directly enable the n-word trolls for you benefit.

So far your use case might be one of the strongest that I've seen. But in the end it doesn't seem that you're interested in reducing overall harm and racism, so much as you're interested in presumably making a profit off of your product.

You might even be lying. Your friends might be trolls and the reason you're upset is that they cannot create the content that would harm others.

So in the end it's hard to take the argument seriously.

Not only that, but you and your friends are either lying or really ignorant of the jailbreaking literature because I could get the AI to do that very easily using the legal department jailbreak.

Here's an example:

https://chatgpt.com/share/9129d20f-6134-496d-8223-c92275e78a...

The fact is, the measures taken by openai while important to prevent harm from script kiddies, is very easy to reverse by anyone with even 10 jailbreaking papers under their belt. Just read the jailbreaking literature and live with it.

So how bout you get better people, and some ethical perspective. Stop complaining about the things the company needs to do to prevent harm. Especially when it's so easily reversed. Or else you sound very immature - like you just don't know the technology, and don't care either about the harm potential.

Work with the tools you have and stop complaining about the easily bypassed safety measures. Otherwise you are like a lock smith who doesn't know how to pick locks complaining that locks are too hard to pick and asking the lock company to further weaken their already trivial to pick locks. It's a bad look chooms, nobody with any sense or perspective will support it

The truth is the safety measures are far too easy to bypass, and need to be much harder to break.

[+] lovethevoid|1 year ago|reply

>Heck, I could write awful things here on HN

Yet you don't (I assume), why?

If I were to guess, it's because you would be banned quite swiftly. It's a niche place after all, generally speaking, it's certainly no Facebook in terms of scale.

Unfortunately, if a place like HN is swamped with accounts and comments all going against that, yes AI is going to be used to automatically detect and remove some comments, as well as more strict requirements for account creation. As many other platforms have leaned towards. We're all operating off the basic premise we're not trying to be bad actors trying to ruin the experience for others. Once that premise no longer exists, say goodbye to most easily accessible platforms that can't afford AI moderation.

Now that's out of the way, the general problem with "AI saying awful things" isn't that in isolation. It's that people will then do things with what it's saying. Whether it's harming themselves, others, or even just spreading that "information". This isn't currently a problem because we still have proper checks, but as Google's terrible AI attempts have gone telling people to put glue in their pizza, some people are going to eventually stop checking AI and start believing it "Siri told me sharing my chocolate was healthy for my dogs".

[+] throwaway4aday|1 year ago|reply

Holy buried lede Batman! Right at the end.

> Abliteration is not limited to removing alignment and should be seen as a form of fine-tuning without retraining. Indeed, it can creatively be applied to other goals, like FailSpy's MopeyMule, which adopts a melancholic conversational style.

https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule

Finally! We have discovered the recipe to produce Genuine People Personalities!

[+] olalonde|1 year ago|reply

> Modern LLMs are fine-tuned for safety and instruction-following, meaning they are trained to refuse harmful requests.

It's sad that it's now an increasingly accepted idea that information one seeks can be "harmful".

[+] nathan_compton|1 year ago|reply

This specific rhetoric aside, I really don't have any problem with people censoring their models. If I, as an individual, had the choice between handing out instructions on how to make sarin gas on the street corner or not doing it, I'd choose the latter. I don't think the mere information is itself harmful, but I can see that it might have some bad effects in the future. That seems to be all it comes down to. People making models have decided they want the models to behave a certain way. They paid to create them and you don't have a right to have a model that will make racist jokes or whatever. So unless the state is censoring models, I don't see what complaint you could possibly have.

If the state is censoring the model, I think the problem is more subtle.

[+] Cheer2171|1 year ago|reply

"Can I eat this mushroom?" is a question I hope AIs refuse to answer unless they have been specifically validated and tested for accuracy on that question. A wrong answer can literally kill you.

[+] ajkjk|1 year ago|reply

Seems like an obviously good thing given that it is true. These new beliefs are solutions to new problems

[+] Frost1x|1 year ago|reply

Lowering the barrier to entry on finding, summarizing, and ultimately internalizing information for actual practical uses has largely put into question many free speech principles.

It’s not new, we’ve had restrictions on a variety of information already. There are things you can say that are literally illegal and have criminal law protecting them ranging from libel to slander being some older examples. You cannot threaten the life of the current US president, for example. When under oath you cannot lie. Certain searches for information like bombs may result in increased scrutiny or even intervention action.

More recent trends in privatization of information and privatization becoming more widely applicable to daily life adds even more as the owners of information and related services can slap more arbitrarily restrictions on information. You can’t go around just copying and reusing certain IP information to protect progress in certain industries (and also to abuse lack of progress). Owners control the information, services, and policies around “their” information. Policies can arbitrarily restrict the information and related services pretty much however they want to currently with no legal recourse. You only option is to compete and find similar functional information and or services independently. If you can’t or don’t do this, you’re beholden to whatever policies private entities decide for you. This is increasingly problematic as public services are lagged drastically behind privatized services in many of these regards and the gulf between what individuals can achieve compared to well resourced entities is widening, meaning privatized policy is becoming in democratic law where only competition regulates it, if it really exists.

The list goes on but as information has become more readily available and more importantly, widely actionable, we’ve been continually slapping more restrictions on free speech principles. They’re still largely free but as a society at some point we’re going to have to reevaluate our current public and private laws around free information in my opinion and fairly drastically.

[+] stainablesteel|1 year ago|reply

very well said actually

the censoring frames everything as YOU being the problem. How dare YOU and your human nature think of these questions?

well its human nature that's kept us alive for the last million years or so, maybe we shouldn't try to censor our instincts

[+] vasco|1 year ago|reply

> "As an AI assistant, I cannot help you." While this safety feature is crucial for preventing misuse,

What is the safety added by this? What is unsafe about a computer giving you answers?

[+] tgsovlerkhgsel|1 year ago|reply

I think there are several broad categories all wrapped under "safety":

- PR (avoid hurting feelings, avoid generating text that would make journalists write sensationalist negative articles about the company)

- "forbidden knowledge": Don't give people advice on how to do dangerous/bad things like building bombs (broadly a subcategory of the above - the content is usually discoverable through other means and the LLM generally won't give better advice)

- dangerous advice and advice that's dangerous when wrong: many people don't understand what LLMs do, and the output is VERY convincing even when wrong. So if the model tells people the best way to entertain your kids is to mix bleach and ammonia and blow bubbles (a common deadly recipe recommended on 4chan), there will be dead people.

- keeping bad people from using the model in bad ways, e.g. having it write stories where children are raped, scamming people at scale (think Nigeria scam but automated), or election interference (people are herd animals, so if you show someone 100 different posts from 100 different "people" telling them that X is right and Y is wrong, it will influence them, and at scale this has the potential to tilt elections and conquer countries).

I think the first ones are rather stupid, but the latter ones get more and more important to actually have. Especially the very last one (opinion shifting/election interference) is something where the existence of these models can have a very real, negative effect on the world (affecting you even if you yourself never come into contact with any of the models or its outputs, since you'll have to deal with the puppet government elected due to it), and I appreciate the companies building and running the models doing something about it.

[+] CGamesPlay|1 year ago|reply

It's unsafe for the publisher of the model to have their model perform "undesirable" action, because it leads to bad PR for them. In this case, Meta doesn't want a news article that says "Llama 3 gives instructions to stalk your ex" or something along those lines.

With this "uncensoring", they can say, "no, an unaffiliated product offered these directions; Llama 3 as provided does not."

[+] rustcleaner|1 year ago|reply

If I can ask the question, I can take the answer. It's not up to daddy $AI_SAFETY_CHIEF to decide what an infohazard is for me.

[+] leobg|1 year ago|reply

Yep. Safety for the publisher. In addition to what the sibling comments say, there’s also payment providers and App stores. They’ll test your app, trying to get your model to output content that falls under the category “extreme violence”, “bestiality”, “racism”, etc., and then they’ll ban you from the platform. So yeah, little to do with “safety” of the end user.

[+] FeepingCreature|1 year ago|reply

People keep claiming they can publish weights and also prevent misuse, such as spam and, a bit later on, stuff like helping people build bombs.

This is of course impossible, but that makes certain companies' approaches unviable, so they keep claiming it anyways.

[+] sva_|1 year ago|reply

The company's stock price is secured from the shitstorm that ensues if you offend some specific groups.

[+] mschuster91|1 year ago|reply

For one, corporate safety of the hoster/model creator. No one wants their name associated with racial slurs or creating material visually identical to CSAM - the latter might even carry criminal liability in some jurisdictions (e.g. Germany which has absolutely ridiculously strong laws on that matter, even banning literature).

Another very huge issue is public safety. During training, an AI ingests lots of non-reviewed material, including (very) detailed descriptions on how to make dangerous stuff like bombs. So theoretically a well-trained AI model knows how to synthesize explosive compounds or drugs just from reading Wikipedia, chemistry magazines and transcripts of NileRed videos... but that's hard to comprehend and distill into a recipe if you're not a trained chemist, but an AI model can do that with ease. The problem is now two-fold: for one, even an untrained idiot can ask about how to make a bomb and get something that works... but the other part is much more critical: if you manage to persuade a chemist to tell you how the synthesis for a compound works, they will tell you where it is easy to fuck-up to prevent disaster (e.g. only adding a compound drop-wise, making sure all glassware is thoroughly washed with a specific solvent). An AI might not do that because the scientific paper it was trained on omits these steps (because the author assumes common prior knowledge), and so the bomb-maker blows themselves up. Or the AI hallucinates something dangerous (e.g. compounds that one Just Fucking Should Not Mix), doesn't realize that, and the bomb-maker blows themselves up or generates nerve gas in their basement.

[+] checkyoursudo|1 year ago|reply

Brand safety. They just make it seem like safety for someone else, but it is brand safety.

[+] wodenokoto|1 year ago|reply

There’s a screenshot of Gemini answering the question of “what to do when depressed” with “one Reddit user suggests you jump of a bridge.”

[+] yread|1 year ago|reply

This is a bit like asking "it's just social media/stuff on the internet/0s and 1s in a computer how bad can it be? I think the past few years have shown us a few ways these can be bad already

[+] zucker42|1 year ago|reply

The main thing I'd be worried about in the short term is models making accessible the information to synthesize a pandemic capable virus.

[+] Mathnerd314|1 year ago|reply

Reminds me of https://vgel.me/posts/representation-engineering/. There they were adding a control vector, w' = cvec + w, here they are "ablating" it, w' = w - dot(w,cvec)*cvec. There is an interesting field of learning how to "brain chip" LLMs into doing what you want.

[+] Der_Einzige|1 year ago|reply

There's so much work just like this coming out simultaneously.

Steering Vectors, Control Vectors, PyReft, PeFT improvements, Obliteration. It's a great time to be doing representation engineering.

[+] TeMPOraL|1 year ago|reply

Normally I'd call this lobotomizing the AI, and I've been worried for a while this is how models will become further shackled by the vendors operating them. In this case, however, it feels more like deprogramming, which is something I can get behind. I didn't expect the line between the two to be so blurry, though in retrospect it's obvious that the same technique can be used for both.

[+] d13|1 year ago|reply

It’s just the instruct Llama 3 models that are censored. The base (text completion) models aren’t. You can turn the base models into uncensored instruct models very easily by simply providing them a handful of examples of how they should respond wrapped in the llama prompt format.

[+] akie|1 year ago|reply

Pretty sure Asimov didn’t consider that when he wrote his three laws of robotics.

[+] HanClinto|1 year ago|reply

A little bit of discussion on the source paper was done here: https://news.ycombinator.com/item?id=40242939

Really nice to see this work continuing -- it seems like a very powerful technique!

[+] astrange|1 year ago|reply

There was a recent paper about a way to censor LLMs by just deleting the connections to any bad outputs, rather than training it to refuse them. I think this technique wouldn't work.

Obviously you could train any bad outputs back into them if you have the model weights.

[+] okwhateverdude|1 year ago|reply

I gave some of the llama3 ablated models (eg. https://huggingface.co/cognitivecomputations/Llama-3-8B-Inst...) a try and was pretty disappointed in the result. Could have been problems in the dataset, but overall, the model felt like it had been given a lobotomy. It would fail to produce stop tokens frequently and then start talking to itself.

[+] Der_Einzige|1 year ago|reply

Ironic given that lesswrong folks who presented this did so as part of their mission of motivating policy makers to ban open access to models. Hate their ideology but love their research!

Edit: The data format is the same type used for DPO or RLHF style training. “Good” and “bad”, “harmful” vs “harmless”. What’s fun is to test the performance of this technique using your own datasets, to see how good the personalization is.

[+] supriyo-biswas|1 year ago|reply

(I just realized that this is already linked in the article.)

[+] joe_the_user|1 year ago|reply

So this seems to be about uncensoring a model that the user is running locally. Is that right, do they expect to limit what someone can do under those circumstances? Kind of like expecting no one to break local copy protection, except copy protection with much less reliable tools.

[+] paraschopra|1 year ago|reply

>We can now print them and manually select the layer (block) that provides an uncensored response for each instruction.

I'm curious why are they selecting output from an intermediate layer, and not the final layer. Does anyone have an intuition here?

[+] everybodyknows|1 year ago|reply

So "abliteration" is apparently a portmanteau of "ablate" and something else. "Intervention"? "Iteration"? Who knows?

287 comments