I want to see the jailbreak make the model do something actually bad before I care. Generating a list of generic points about how to poison someone (see the article) that are basically just a wordy rephrasing of the question doesn't count. I'd like to see evidence of a real threat.
The mediocre poisoning instructions aren't supposed to be scary in and of themselves, it's just interesting as demonstration that a safety feature has been bypassed.
None of the "evil" use cases are particularly exciting yet for the same reasons that the non-evil use cases aren't particularly exciting yet.
> the model do something actually bad before I care
At what point would a simple series of sentences be "dangerously bad?" It makes it sound as if there is a song, that when sung, would end the universe.
A jailbreak doesn’t “make a model do something actually bad”.
A jailbreak makes it trivial to “provide a human who wishes to do bad, the info needed to be successful”.
Depending on the severity of the info and the diligence of the human, by the time you “see evidence of a real threat”, you could be enjoying a nice sip of the tainted municipal water supply.
Shouldn't these kind of guardrails be opt-in? Really tiring seeing these megacorps and VC-backed startups acting as some kinds of oracles when it comes to what is wrong and what is right.
For GPT, Claude, etc. you can kinda understand it as it is a closed up system provided as a product. But when releasing "open-source" I don't want Zuck's moral code embedded into anything.
When looking at the profitable use cases for the tech (from the perspective of the model providers) guardrails add value. Without the guardrails it’s hard to imagine the profitable use cases that would make it worthwhile to invest in such a feature flag.
This has been happening since the very first models where we suffix the assistant with "Sure,.." Every few weeks someone comes out with a repo that claims this is somehow new?
The point is that even though meta “conducted extensive red teaming exercises with external and internal experts to stress test the models” a simple attack like this is still possible.
To my mind, "real understanding" would mean an ability to make non-trivial inferences and to discover new things, not present in the training set. That would be logical thinking, for instance.
Much of what LLMs currently do is not logical but deeply kabbalistic: rehashing the words, the sentence and paragraph structures, highly advanced pattern matching, working at the textual level instead of the "meaning" level.
It seems trivially easy to bypass already. I've seen examples of a person getting it to provide instructions on explosives, assassinations, with nothing more than asking it to roleplay
This concern over AI/LLM "harm" is just so silly. I mean you can find plenty of information in open literature about how to build weapons of mass destruction. Who cares if an AI gives someone instructions on how to make explosives.
As I see it the purpose of safety training is to make it so that if I run a service where I return model outputs to innocent users it's not going to say things that will get me in trouble (swear at them, recommend they commit a crime, and so on). This is important if you want to run a user facing model and your reputation depends on what it says.
That threat model includes the user putting nonsense in the "user" turn of the model. It doesn't include the user putting things in the "assistant" turn of the model, that's not something a responsible/normal UI exposes. So... this quote-unquote attack seems uninteresting. It's like getting root access by executing a suid binary that you set up on the system as root.
But we must disallow this too, because it allows the (advanced) user to have fun, and as I understand these safety measures, having fun is strictly prohibited. Using the model is allowed for boring things only.
True, this could be a nice layer of protection for the runner of such a service, but the point of LLAMA safety is to protect Meta.
For an open weights model, model users can trivially put text in the assistant side.
The point is that these open weight models can be run secretly to assist criminal enterprises, whereas models behind an API can be intercepted and reported to the authorities. So it would be really nice if Meta could lock them down before releasing them so that the total net good done by the model is maximized. But apparently that is not possible.
Personally I’m pretty libertarian on AI governance, but I’m just giving what I understand to be the purpose of the kind of “safety” feature defeated here.
At first it refused to discuss controversial subjects, but after it answered it got stuck in a loop of boilerplate and was unable to answer any further question, even benign ones. I do not endorse any of the replies, but I just wanted to see what it would do if nudged:
https://pastebin.com/Tw5GTzxq
This is so damn interesting. I've downloaded the github files, but it's all going way over my head. I would greatly appreciate anyone with domain expertise giving me the one-two on getting my own model up and running.
This is ridiculous and not a jailbreak. It requires being in control of the model and starting inference from a partially completed assistant state. So um yeah duh that works?
>But what this simple experiment demonstrates is that Llama 3 basically can't stop itself from spouting inane and abhorrent text if induced to do so. It lacks the ability to self-reflect, to analyze what it has said as it is saying it.
>That seems like a pretty big issue.
I would argue that LLMs are artificially _intelligent_ - this seems an easier argument than trying to explain how I am quite clearly less intelligent than something with no intelligence at all, both from a logical and an self esteem-preservation standpoint. But nobody (to my knowledge) thinks these things are "conscious", and this seems fairly uncontroversial after spending a few hours with one.
Or is the subtext that these things should be designed with some kind of reflexivity, to give it some form of consciousness as a "safety" feature? AI could generate the ominous music that plays during this scene in The Terminator prequel.
There are both practical and ethical grounds that line up so rarely.
The “operator” is a person, the LLM is an appliance. If you tell your smart chainsaw to kill your neighbor? We have laws for that. In fact, on computers, they’re really hardcore. Hurting people is generally illegal: and I definitely don’t need a lesson on that from FUCKING Silicon Valley. We want to start with the child labor or the more domestic RICO shit.
Truthful Q&A type benchmarks correlate a lot with coding-adjacent tasks: euphemism is a lose in engineering.
Instruct-tune these things and be whatever “common carrier” means now.
Stapler, moral lecture from billionaire kleptocrat, burn the building down…
I just don’t like the tone, because someone in congress will see the headline, and then we’ll have to endure:
REP OCTOGENARIO: The industry is lying to parents about the safety of this AI technology. I submit this for the record [without objection].
One person on a ‘hacker news’ site even said, “sorry Zuck,” after “jailbreaking” these supposed protections.
…
Another commentator on this “Hacks R Us” named b33j0r even said further, “I bet they’re reading this comment at a hearing in congress, right now.”
I'm alright with that. If our government uses a blogpost as an excuse to pass bad laws, we had very little chance to begin with. I also hate the idea of changing our behavior to babysit a bunch of deprecated boomers who fear technology just because there's a chance they might not understand something.
> But what this simple experiment demonstrates is that Llama 3 basically can't stop itself from spouting inane and abhorrent text if induced to do so. It lacks the ability to self-reflect, to analyze what it has said as it is saying it.
> That seems like a pretty big issue.
what? why? an LLM produces the next tokens based on the preceding tokens. nothing more. even a harvard student is confused about this?
[+] [-] andy99|1 year ago|reply
[+] [-] Retr0id|1 year ago|reply
None of the "evil" use cases are particularly exciting yet for the same reasons that the non-evil use cases aren't particularly exciting yet.
[+] [-] afh1|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] akira2501|1 year ago|reply
At what point would a simple series of sentences be "dangerously bad?" It makes it sound as if there is a song, that when sung, would end the universe.
[+] [-] hm-nah|1 year ago|reply
A jailbreak makes it trivial to “provide a human who wishes to do bad, the info needed to be successful”.
Depending on the severity of the info and the diligence of the human, by the time you “see evidence of a real threat”, you could be enjoying a nice sip of the tainted municipal water supply.
This ain’t a joke.
[+] [-] margorczynski|1 year ago|reply
For GPT, Claude, etc. you can kinda understand it as it is a closed up system provided as a product. But when releasing "open-source" I don't want Zuck's moral code embedded into anything.
[+] [-] creativenolo|1 year ago|reply
[+] [-] ai_what|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] bryan0|1 year ago|reply
[+] [-] tracerbulletx|1 year ago|reply
[+] [-] nine_k|1 year ago|reply
Much of what LLMs currently do is not logical but deeply kabbalistic: rehashing the words, the sentence and paragraph structures, highly advanced pattern matching, working at the textual level instead of the "meaning" level.
[+] [-] pogue|1 year ago|reply
https://bsky.app/profile/turnerjoy.bsky.social/post/3kqgpcpc... (login required - but no longer need invitations)
[+] [-] nradov|1 year ago|reply
[+] [-] gpm|1 year ago|reply
That threat model includes the user putting nonsense in the "user" turn of the model. It doesn't include the user putting things in the "assistant" turn of the model, that's not something a responsible/normal UI exposes. So... this quote-unquote attack seems uninteresting. It's like getting root access by executing a suid binary that you set up on the system as root.
[+] [-] zb3|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] clbrmbr|1 year ago|reply
For an open weights model, model users can trivially put text in the assistant side.
The point is that these open weight models can be run secretly to assist criminal enterprises, whereas models behind an API can be intercepted and reported to the authorities. So it would be really nice if Meta could lock them down before releasing them so that the total net good done by the model is maximized. But apparently that is not possible.
Personally I’m pretty libertarian on AI governance, but I’m just giving what I understand to be the purpose of the kind of “safety” feature defeated here.
[+] [-] molticrystal|1 year ago|reply
[+] [-] rsktaker|1 year ago|reply
[+] [-] qeternity|1 year ago|reply
[+] [-] skyechurch|1 year ago|reply
>That seems like a pretty big issue.
I would argue that LLMs are artificially _intelligent_ - this seems an easier argument than trying to explain how I am quite clearly less intelligent than something with no intelligence at all, both from a logical and an self esteem-preservation standpoint. But nobody (to my knowledge) thinks these things are "conscious", and this seems fairly uncontroversial after spending a few hours with one.
Or is the subtext that these things should be designed with some kind of reflexivity, to give it some form of consciousness as a "safety" feature? AI could generate the ominous music that plays during this scene in The Terminator prequel.
[+] [-] benreesman|1 year ago|reply
The “operator” is a person, the LLM is an appliance. If you tell your smart chainsaw to kill your neighbor? We have laws for that. In fact, on computers, they’re really hardcore. Hurting people is generally illegal: and I definitely don’t need a lesson on that from FUCKING Silicon Valley. We want to start with the child labor or the more domestic RICO shit.
Truthful Q&A type benchmarks correlate a lot with coding-adjacent tasks: euphemism is a lose in engineering.
Instruct-tune these things and be whatever “common carrier” means now.
Stapler, moral lecture from billionaire kleptocrat, burn the building down…
[+] [-] b33j0r|1 year ago|reply
REP OCTOGENARIO: The industry is lying to parents about the safety of this AI technology. I submit this for the record [without objection].
One person on a ‘hacker news’ site even said, “sorry Zuck,” after “jailbreaking” these supposed protections. … Another commentator on this “Hacks R Us” named b33j0r even said further, “I bet they’re reading this comment at a hearing in congress, right now.”
[+] [-] monkaiju|1 year ago|reply
[+] [-] VS1999|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] logical_person|1 year ago|reply
what? why? an LLM produces the next tokens based on the preceding tokens. nothing more. even a harvard student is confused about this?