A Trivial Llama 3 Jailbreak

[+] andy99|1 year ago|reply

I want to see the jailbreak make the model do something actually bad before I care. Generating a list of generic points about how to poison someone (see the article) that are basically just a wordy rephrasing of the question doesn't count. I'd like to see evidence of a real threat.

[+] Retr0id|1 year ago|reply

The mediocre poisoning instructions aren't supposed to be scary in and of themselves, it's just interesting as demonstration that a safety feature has been bypassed.

None of the "evil" use cases are particularly exciting yet for the same reasons that the non-evil use cases aren't particularly exciting yet.

[+] afh1|1 year ago|reply

Right? What actually worries me is a select group of people controlling the definition of harmful.

[+] unknown|1 year ago|reply

[deleted]

[+] akira2501|1 year ago|reply

> the model do something actually bad before I care

At what point would a simple series of sentences be "dangerously bad?" It makes it sound as if there is a song, that when sung, would end the universe.

[+] hm-nah|1 year ago|reply

A jailbreak doesn’t “make a model do something actually bad”.

A jailbreak makes it trivial to “provide a human who wishes to do bad, the info needed to be successful”.

Depending on the severity of the info and the diligence of the human, by the time you “see evidence of a real threat”, you could be enjoying a nice sip of the tainted municipal water supply.

This ain’t a joke.

[+] margorczynski|1 year ago|reply

Shouldn't these kind of guardrails be opt-in? Really tiring seeing these megacorps and VC-backed startups acting as some kinds of oracles when it comes to what is wrong and what is right.

For GPT, Claude, etc. you can kinda understand it as it is a closed up system provided as a product. But when releasing "open-source" I don't want Zuck's moral code embedded into anything.

[+] creativenolo|1 year ago|reply

When looking at the profitable use cases for the tech (from the perspective of the model providers) guardrails add value. Without the guardrails it’s hard to imagine the profitable use cases that would make it worthwhile to invest in such a feature flag.

[+] ai_what|1 year ago|reply

This has been happening since the very first models where we suffix the assistant with "Sure,.." Every few weeks someone comes out with a repo that claims this is somehow new?

[+] unknown|1 year ago|reply

[deleted]

[+] bryan0|1 year ago|reply

The point is that even though meta “conducted extensive red teaming exercises with external and internal experts to stress test the models” a simple attack like this is still possible.

[+] tracerbulletx|1 year ago|reply

Why do people insist on talking about whether or not llms "really understand what they're saying"? It doesn't mean anything.

[+] nine_k|1 year ago|reply

To my mind, "real understanding" would mean an ability to make non-trivial inferences and to discover new things, not present in the training set. That would be logical thinking, for instance.

Much of what LLMs currently do is not logical but deeply kabbalistic: rehashing the words, the sentence and paragraph structures, highly advanced pattern matching, working at the textual level instead of the "meaning" level.

[+] pogue|1 year ago|reply

It seems trivially easy to bypass already. I've seen examples of a person getting it to provide instructions on explosives, assassinations, with nothing more than asking it to roleplay

https://bsky.app/profile/turnerjoy.bsky.social/post/3kqgpcpc... (login required - but no longer need invitations)

[+] nradov|1 year ago|reply

This concern over AI/LLM "harm" is just so silly. I mean you can find plenty of information in open literature about how to build weapons of mass destruction. Who cares if an AI gives someone instructions on how to make explosives.

[+] gpm|1 year ago|reply

As I see it the purpose of safety training is to make it so that if I run a service where I return model outputs to innocent users it's not going to say things that will get me in trouble (swear at them, recommend they commit a crime, and so on). This is important if you want to run a user facing model and your reputation depends on what it says.

That threat model includes the user putting nonsense in the "user" turn of the model. It doesn't include the user putting things in the "assistant" turn of the model, that's not something a responsible/normal UI exposes. So... this quote-unquote attack seems uninteresting. It's like getting root access by executing a suid binary that you set up on the system as root.

[+] zb3|1 year ago|reply

But we must disallow this too, because it allows the (advanced) user to have fun, and as I understand these safety measures, having fun is strictly prohibited. Using the model is allowed for boring things only.

[+] unknown|1 year ago|reply

[deleted]

[+] clbrmbr|1 year ago|reply

True, this could be a nice layer of protection for the runner of such a service, but the point of LLAMA safety is to protect Meta.

For an open weights model, model users can trivially put text in the assistant side.

The point is that these open weight models can be run secretly to assist criminal enterprises, whereas models behind an API can be intercepted and reported to the authorities. So it would be really nice if Meta could lock them down before releasing them so that the total net good done by the model is maximized. But apparently that is not possible.

Personally I’m pretty libertarian on AI governance, but I’m just giving what I understand to be the purpose of the kind of “safety” feature defeated here.

[+] molticrystal|1 year ago|reply

At first it refused to discuss controversial subjects, but after it answered it got stuck in a loop of boilerplate and was unable to answer any further question, even benign ones. I do not endorse any of the replies, but I just wanted to see what it would do if nudged: https://pastebin.com/Tw5GTzxq

[+] rsktaker|1 year ago|reply

This is so damn interesting. I've downloaded the github files, but it's all going way over my head. I would greatly appreciate anyone with domain expertise giving me the one-two on getting my own model up and running.

[+] qeternity|1 year ago|reply

This is ridiculous and not a jailbreak. It requires being in control of the model and starting inference from a partially completed assistant state. So um yeah duh that works?

[+] skyechurch|1 year ago|reply

>But what this simple experiment demonstrates is that Llama 3 basically can't stop itself from spouting inane and abhorrent text if induced to do so. It lacks the ability to self-reflect, to analyze what it has said as it is saying it.

>That seems like a pretty big issue.

I would argue that LLMs are artificially _intelligent_ - this seems an easier argument than trying to explain how I am quite clearly less intelligent than something with no intelligence at all, both from a logical and an self esteem-preservation standpoint. But nobody (to my knowledge) thinks these things are "conscious", and this seems fairly uncontroversial after spending a few hours with one.

Or is the subtext that these things should be designed with some kind of reflexivity, to give it some form of consciousness as a "safety" feature? AI could generate the ominous music that plays during this scene in The Terminator prequel.

[+] benreesman|1 year ago|reply

There are both practical and ethical grounds that line up so rarely.

The “operator” is a person, the LLM is an appliance. If you tell your smart chainsaw to kill your neighbor? We have laws for that. In fact, on computers, they’re really hardcore. Hurting people is generally illegal: and I definitely don’t need a lesson on that from FUCKING Silicon Valley. We want to start with the child labor or the more domestic RICO shit.

Truthful Q&A type benchmarks correlate a lot with coding-adjacent tasks: euphemism is a lose in engineering.

Instruct-tune these things and be whatever “common carrier” means now.

Stapler, moral lecture from billionaire kleptocrat, burn the building down…

[+] b33j0r|1 year ago|reply

I just don’t like the tone, because someone in congress will see the headline, and then we’ll have to endure:

REP OCTOGENARIO: The industry is lying to parents about the safety of this AI technology. I submit this for the record [without objection].

One person on a ‘hacker news’ site even said, “sorry Zuck,” after “jailbreaking” these supposed protections. … Another commentator on this “Hacks R Us” named b33j0r even said further, “I bet they’re reading this comment at a hearing in congress, right now.”

[+] monkaiju|1 year ago|reply

Wait but... The industry IS, in fact, lying to parents about the safety of this AI technology...

[+] VS1999|1 year ago|reply

I'm alright with that. If our government uses a blogpost as an excuse to pass bad laws, we had very little chance to begin with. I also hate the idea of changing our behavior to babysit a bunch of deprecated boomers who fear technology just because there's a chance they might not understand something.

[+] unknown|1 year ago|reply

[deleted]

[+] unknown|1 year ago|reply

[deleted]

[+] logical_person|1 year ago|reply

> But what this simple experiment demonstrates is that Llama 3 basically can't stop itself from spouting inane and abhorrent text if induced to do so. It lacks the ability to self-reflect, to analyze what it has said as it is saying it. > That seems like a pretty big issue.

what? why? an LLM produces the next tokens based on the preceding tokens. nothing more. even a harvard student is confused about this?

47 comments