top | item 44712122

(no title)

1 points| draismaa | 7 months ago

discuss

draismaa|7 months ago

We open-sourced Scenario, a tiny framework to simulate and test AI agents by using another AI agent — much like how self-driving cars are tested in controlled environments before real-world deployment.

The idea: if you wouldn’t deploy an autonomous vehicle without simulation, why would you deploy an AI agent without pressure-testing it first?

Scenario lets you: Describe multi-turn user flows (e.g. “book a flight and cancel it”) Set success criteria ("user got confirmation + refund info") Have one agent simulate the user, testing how another agent performs the task Debug regressions, track failures, iterate on behavior

It’s like writing unit tests, but for conversations.

Why? Traditional evals fall short when testing multi-step flows, memory handling, tool use, or goal alignment. Inspired by simulation in autonomous vehicle testing, where cars are exposed to rare, edge-case situations at scale, Scenario allows similar validation loops for AI agents.

Instead of hoping your support bot or agent workflow handles complexity… you simulate it:

Works with any agent-framework you define

Minimal setup required. There are a few starter examples in the repo: https://github.com/langwatch/scenario

We’d love feedback from the HN community:

What test patterns do you (want to) use for agent workflows?

Thanks!