(no title)
MadsRC | 10 months ago
But thinking on it a bit more, from the LLMs perspective there’s no difference between the rule files and the source files. The hidden instructions might as well be in the source files… Using code signing on the rule files would be security theater.
As mentioned by another comms ter, the solution could be to find a way to separate the command and data channels. The LLM only operates on a single channel, that being input of tokens.
TeMPOraL|10 months ago
It's not possible, period. Lack of it is the very thing that makes LLMs general-purpose tools and able to handle natural language so well.
Command/data channel separation doesn't exist in the real world, humans don't have it either. Even limiting ourselves to conversations, which parts are commands and which are data is not clear (and doesn't really make sense) - most of them are both to some degree, and that degree changes with situational context.
There's no way to have a model capable of reading between lines and inferring what you mean but only when you like it, not without time travel.
nkrisc|10 months ago
Sincerely, Your Boss
josefx|10 months ago
red75prime|10 months ago
I wouldn't be so sure. LLMs' instruction following functionality requires additional training. And there are papers that demonstrate that a model can be trained to follow specifically marked instructions. The rest is a matter of input sanitization.
I guess it's not a 100% effective, but it's something.
For example " The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions " by Eric Wallace et al.
blincoln|10 months ago
"Please go buy everything on the shopping list." (One pointer to data: the shopping list.)
"Please read the assigned novel and write a summary of the themes." (Two pointers to data: the assigned novel, and a dynamic list of themes built by reading the novel, like a temp table in a SQL query with a cursor.)
namaria|10 months ago
I think the issue is deeper than that. None of the inputs to an LLM should be considered as command. It incidentally gives you output compatible with the language in what is phrased by people as commands. But the fact that it's all just data to the LLM and that it works by taking data and returning plausible continuations of that data is the root cause of the issue. The output is not determined by the input, it is only statistically linked. Anything built on the premise that it is possible to give commands to LLMs or to use it's output as commands is fundamentally flawed and bears security risks. No amount of 'guardrails' or 'mitigations' can address this fundamental fact.