PA bench: Evaluating web agents on real world personal assistant workflows
38 points| shahules | 5 days ago |vibrantlabs.com
We built PA Bench (Personal Assistant Benchmark) to evaluate frontier computer/web use models on their ability to handle multi-step workflows across simulated clones of Gmail and Calendar.
*What’s next:*
We’re currently scaling the dataset to 3+ tabs and are building more high-fidelity simulations for common enterprise workflows. We’d love to hear feedback on the benchmark and notes about what was/wasn’t surprising about the results.
Blog post: https://vibrantlabs.com/blog/pa-bench
mrorigo|4 days ago
shenberg|4 days ago
TeMPOraL|3 days ago
abhijithneil|4 days ago
shahules|4 days ago
AIorNot|4 days ago
https://news.ycombinator.com/item?id=47125014
maybe this benchmark will be conquered far faster then expected
shahules|4 days ago
shahules|4 days ago
[deleted]
unknown|11 days ago
[deleted]
MidasTools|4 days ago
[deleted]