(no title)
pshc | 1 month ago
Effectively system instructions and server-side prompts are red, whereas user input is normal text.
It would have to be trained from scratch on a meticulous corpus which never crosses the line. I wonder if the resulting model would be easier to guide and less susceptible to prompt injection.
tempaccsoz5|1 month ago
You could just include an extra single bit with each token that represents trusted or untrusted. Add an extra RL pass to enforce it.