Show HN: Open Operator Evals – real-world benchmarks for LLM web agents
3 points| monoid73 | 8 months ago |github.com
It evaluates real-world tasks, like logging in, scraping dashboards, and submitting forms, using structured criteria: success rate, latency, and task reliability.
Everything is fully reproducible, with all outputs, logs, and evaluation data available.
https://github.com/nottelabs/open-operator-evals
Feedback, critiques, or contributions welcome:)
pancsta|8 months ago