(no title)
shich | 1 day ago
The deeper issue: most of these guardrails assume the threat is accidental (agent goes off the rails) rather than adversarial (something in the agent's context is actively trying to manipulate it). Time-boxed domain whitelists help with the latter but the audit loop at session end is still reactive.
The /revert snapshot idea is underrated though. Reversibility should be the first constraint, not an afterthought.
buremba|1 day ago
Yes, I'm experimenting using a small model like Haiku to double check if the request looks good. It adds quite a bit of latency but it might be the right approach.
Honestly; it's still pretty much like early days of self driving cars. You can see the car can go without you supervising it but still you need to keep an eye on where it's going.