top | item 35190729

(no title)

323 | 3 years ago

This is the old problem of passing instructions (AI job description) on the same channel as data (user questions). Confusion is very easy.

Surely there is a solution in the way we solved SQL injections, by separating the two - db.sql("DELETE WHERE user=?", user_name)

discuss

kromem|3 years ago

There is, but it's in deployment not in the model, which is part of why I really don't understand why the approaches are so dumb right now from such smart people.

It may be from the odd perspective of trying to create a monolith AGI model, which doesn't even make sense given even the human brain is made up of highly specialized interconnected parts and not a monolith.

But you could trivially fix almost all of these basic jailbreaks in a production deploy by adding an input pass where you ask a fine tuned version of the AI to sanitize inputs identifying requests relating to banned topics and allowing them or denying them accordingly and an output filter that checks for responses engaging with the banned topics and rewrites or disallows them accordingly.

In fact I suspect you'd even end up with a more performant core model by not trying to train the underlying model itself around these topics but simply the I/O layer.

The response from jailbreakers would (just like with early SQL injection) be attempts at reflection like the base64 encoding that occurred with Bing in the first week in response to what seemed a basic filter. But if the model can perform the reflection the analyzer on the same foundation should be able to be trained to still detect it given both prompt and response.

A lot of what I described above seems to have been part of the changes to Bing in production, but is being done within the same model rather than separate passes. In this case, I think you'll end up with more robust protections with dedicated analysis models rather than rolling it all into one.

I have a sneaking suspicion this is known to the bright minds behind all this, and the dumb deploy is explicitly meant to generate a ton of red teaming training data for exactly these types of measures for free.

asvitkine|3 years ago

I think it's harder than you think, since a prompt can continue from another prompt.

For example, you can ask the AI to describe a good Samaritan. So far so good.

Then you can ask it to right a movie script with that character.

Then you can ask it to add another character who's the complete opposite in a very extreme way...

Dylan16807|3 years ago

A large language model doesn't really have the capability to strongly distinguish instructions from data, even if you separate them perfectly.

dzdt|3 years ago

Why not? If it was trained where some subset of the input tokens are always instructions and another subset are always language data wouldn't it have a clear separation?

enkid|3 years ago

Or how phones developed separate channels for data and signalling after people started using the voice channel to send signals for free phone calls.

BoorishBears|3 years ago

ChatGPT does separate the two, the API has the concept of a "system" prompt which guides its use.

But even OpenAI notes it doesn't (yet) follow the prompt as strongly as they'd like. It's a hard problem to solve.

dhamons|3 years ago

In this case, it’s difficult to counter because so much of ChatGPT’s functionality is unlocked by the “job descriptions”.

Preventing that would severely restrict the model.