top | item 42807083

(no title)

Those numbers are not the full story. Note that GP specifically says: "Big jumps in benchmarks from _Claude's Computer Use_ though." Claude Computer Use was not SOTA for browser tasks at the time of its release (and is still not.)

In WebArena, Operator does 58.1%. Previous SOTA for browser-use agents is 57.1%. In WebVoyager, Operator does 87.0%. Previous SOTA for browser-use agents is the exact same.

See here for details: https://openai.com/index/computer-using-agent/

discuss

cubefox|1 year ago

Those two were two different models (Kura and jace.ai), and one model being SOTA at one benchmark doesn't make it SOTA overall. Moreover, both are specific for browser use, so they don't operate only on raw pixels but can read HTML/DOM, unlike general computer use models which rely on raw screenshots only.

timabdulla|1 year ago

I think I hit all those points in my previous post, except for the fact that it's two different models, as you've noted. That said, neither of them seem to report scores for the other benchmark in each particular case.