top | item 45062316

(no title)

tehryanx | 6 months ago

Personally, I think there's a piece missing in the analogy. I understand that you can put some kind of human-verified mediator in between the LLM and the tool its calling to make sure the parameters are sane, but I also think you're modelling the LLM as a UI element that's generating the request when IMO it makes more sense to model the LLM as the user who is choosing how to interact with the UI elements that are generating the request.

In the context of web-request -> validator -> db query, the purpose of the validator is only to ensure that the request is safe, it doesn't care what the user chose to do as long as it's a reasonable action in the context of the app.

In the context of user -> LLM -> validator -> tool, the validator has to ensure that the request is safe, but the users intention can be changed at the LLM stage. If the user wanted to update a record, but the LLM decides to destroy it, the validator now has to have some way to understand the users initial intention to know whether or not the request is sane.

discuss

tptacek|6 months ago

In my design I'm modeling the LLM with access to untrusted information (emails, support tickets) as an adversary. And I'm saying that the adversarial LLM communicates with the rest of the system through structured messages, not English text.

It turns out Simon Willison has been saying this for some time now; he calls it the "dual LLM" design, I think? (For me, both terms are a little broken; you can have way more than 2, and it's "contexts" you're multiplying, not LLMs.)

tehryanx|5 months ago

Forgive me for belaboring, but I think we're talking past each other a bit. I do understand that in your model the LLM can't send anything unsafe through to the rest of the system. What I'm saying is that the LLM can be manipulated into sending perfectly normal and normally safe requests through to the system that do not align with the users intent.

Imagine an LLM with the ability to read emails, update database records, and destroy database records.

The user instructs the LLM to update a database record, but a malicious injection from one of those emails overrides that with a directive to destroy the database record. Unless the validator understands the users intent somehow, the destructive action would appear perfectly reasonable.