(no title)
wcoenen | 1 month ago
I wonder if might be possible by introducing a concept of "authority". Tokens are mapped to vectors in an embedding space, so one of the dimensions of that space could be reserved to represent authority.
For the system prompt, the authority value could be clamped to maximum (+1). For text directly from the user or files with important instructions, the authority value could be clamped to a slightly lower value, or maybe 0 because the model needs to be balance being helpful against refusing requests from a malicious user. For random untrusted text (e.g. downloaded from the internet by the agent), it would be set to the minimum value (-1).
The model could then be trained to fully respect or completely ignore instructions, based on the "authority" of the text. Presumably it could learn to do the right thing with enough examples.
jcgl|1 month ago
But maybe someone with a deeper understanding can describe how I'm wrong.
wcoenen|1 month ago
Since a token itself carries no information about whether it has "authority" or not, I'm proposing to inject this information in a reserved number in that embedding vector. This needs to be done both during post-training and inference. Think of it as adding color or flavor to a token, so that it is always very clear to the LLM what comes from the system prompt, what comes from the user, and what is random data.
bandrami|1 month ago
immibis|1 month ago
NitpickLawyer|1 month ago
This is what oAI are doing. System prompt is "ring0" and in some cases you as an API caller can't even set it, then there's "dev prompt" that is what we used to call system prompt, then there's "user prompt". They do train the models to follow this prompt hierarchy. But it's never full-proof. These are "mitigations", not solving the underlying problem.
tempaccsoz5|1 month ago