top | item 47068608

(no title)

alexgarden | 11 days ago

The short version: instructions tell the model what to do. An Alignment Card declares what the agent committed to do — and then a separate system verifies it actually did.

Most intent/instruction work (system prompts, Model Spec, tool-use policies) is input-side. You're shaping behavior by telling the model "here are your rules." That's important and necessary. But it's unverifiable — you have no way to confirm the model followed the instructions, partially followed them, or quietly ignored them.

AAP is an output-side verification infrastructure. The Alignment Card is a schema-validated behavioral contract: permitted actions, forbidden actions, escalation triggers, values. Machine-readable, not just LLM-readable. Then AIP reads the agent's reasoning between every action and compares it to that contract. Different system, different model, independent judgment.

Bonus: if you run through our gateway (smoltbot), it can nudge the agent back on course in real time — not just detect the drift, but correct it.

So they're complementary. Use whatever instruction framework you want to shape the agent's behavior. AAP/AIP sits alongside and answers the question instructions can't: "did it actually comply?"

discuss

order

tiffanyh|11 days ago

> Then AIP reads the agent's reasoning between every action and compares it to that contract.

How would this work? Is one LLM used to “read” (and verify) another LLMs reasoning?

alexgarden|11 days ago

Yep... fair question.

So AIP and AAP are protocols. You can implement them in a variety of ways.

They're implemented on our infrastructure via smoltbot, which is a hosted (or self-hosted) gateway that proxies LLM calls.

For AAP it's a sidecar observer running on a schedule. Zero drag on the model performance.

For AIP, it's an inline conscience observer and a nudge-based enforcement step that monitors the agent's thinking blocks. ~1 second latency penalty - worth it when you must have trust.

For both, they use Haiku-class models for intent summarization; actual verification is via the protocols.