top | item 47157160

PA bench: Evaluating web agents on real world personal assistant workflows

38 points| shahules | 5 days ago |vibrantlabs.com

We’re the team at Vibrant Labs (W24). We’ve been building envs for browser agents and quickly realized that existing benchmarks in this space didn’t capture the primary failure modes we were seeing in production (which scaled up as the number of applications and horizon length increase).

We built PA Bench (Personal Assistant Benchmark) to evaluate frontier computer/web use models on their ability to handle multi-step workflows across simulated clones of Gmail and Calendar.

*What’s next:*

We’re currently scaling the dataset to 3+ tabs and are building more high-fidelity simulations for common enterprise workflows. We’d love to hear feedback on the benchmark and notes about what was/wasn’t surprising about the results.

Blog post: https://vibrantlabs.com/blog/pa-bench

9 comments

order

mrorigo|4 days ago

I just don't get why would you would want an agent to use the browser to do these mundane things (check email, work with calendar etc), when you can simply give it a few tools, and save maybe six gazillion tokens per task?

shenberg|4 days ago

Using existing enterprise apps probably - this solution is scalable for the vendor and it's easier to sell using existing software as-is than to start out by writing new custom tools.

TeMPOraL|3 days ago

Adversarial interoperability.

abhijithneil|4 days ago

Is there a possible way computer use can be automated using multiple computer use agents from different providers, but also with some sort of routing setup so the best course of action can be chosen without hitting failures (for eg: permission issues in OpenAI could be rerouted to Gemini)

shahules|4 days ago

There are few agents like browser-use, skyvern etc that may provide this capability.