(no title)
lambda | 12 days ago
And the thing is, even adding a "color" to tokens wouldn't really work, because LLMs are very good at learning patterns of language; for instance, even though people don't usually write with Unicode enclosed alphanumerics, the LLM learns the association and can interpret them as English text as well.
As I say, prompt injection is a very real problem, and Anthopic's own system card says that on some tests the best they do is 50% on preventing attacks.
If you have a more reliable way of fixing prompt injection, you could get paid big bucks by them to implement it.
charcircuit|12 days ago
The same thing could be said about the internet. When it comes down to the wire it's all 0s and 1s.
lambda|12 days ago
The same is not true of an LLM. You cannot predict, precisely, how they are going to work. They can behave unexpectedly in the face of specially crafted input. If you give an LLM two pieces of text, delimited with a marker indicating that one piece is trusted and the other is untrusted, even if that marker is a special token that can't be expressed in band, you can't be sure that it's not going to act on instructions in the untrusted section.
This is why even the leading providers have trouble with protecting against prompt injection; when they have instructions in multiple places in their context, it can be hard to make sure they follow the right instructions and not the wrong ones, since the models have been trained so heavily to follow instructions.