top | item 42497879

(no title)

apike | 1 year ago

While this can be done in principle (it's not a foolproof enough method to, for example, ensure an LLM doesn't leak secrets) it is much harder to fool the supervisor than the generator because:

1. You can't get output from the supervisor, other than the binary enforcement action of shutting you down (it can't leak its instructions)

2. The supervisor can judge the conversation on the merits of the most recent turns, since it doesn't need to produce a response that respects the full history (you can't lead the supervisor step by step into the wilderness)

3. LLMs, like humans, are generally better at judging good output than generating good output

discuss

No comments yet.