This has absolutely nothing in common with a model for computer use... This uses pre-defined tools provided in the MCP server by Google, nothing to do with a general model supposed to work for any software.
Impressively, it also quickly passed levels 1 (checkbox) and 2 (stop sign) on http://neal.fun/not-a-robot, and got most of the way through level 3 (wiggly text).
> ...the task is just to "solve today's Wordle", and as a web browsing robot, I cannot actually see the colors of the letters after a guess to make subsequent guesses. I can enter a word, but I cannot interpret the feedback (green, yellow, gray letters) to solve the puzzle.
Knowing it's technically possible is one thing, but giving it a short command and seeing it go log in to a site, scroll around, reply to posts, etc. is eerie.
Also it tied me at wordle today, making the same mistake I did on the second to lass guess. Too bad you can't talk to it while it's working.
I wonder how it would behave in a scenario where it has to download some file from a shady website that has all those advertisement with fake "download"
I believe it will need very capable but small VLMs that understand common User Interfaces very well -- small enough to run locally -- paired with any other higher level models on the cloud, to achieve human-speed interactions and beyond with reliability.
Really feels like computer use models may be vertical agent killers once they get good enough. Many knowledge work domains boil down to: use a web app, send an email. (e.g. recruiting, sales outreach)
Why do you need an agent use web app through UI? Can't agent be integrated into web app natively? IMO for verticals you mentioned the missing piece is for an agent to be able to make phone calls.
Many years ago I was sitting at a red light on a secondary road, where the primary cross road was idle. It seemed like you could solve this using a computer vision camera system that watched the primary road and when it was idle, would expedite the secondary road's green light.
This was long before computer vision was mature enough to do anything like that and I found out that instead, there are magnetic systems that can detect cars passing over - trivial hardware and software - and I concluded that my approach was just far too complicated and expensive.
Similarly, when I look at computers, I typically want the ML/AI system to operate on a structured data that is codified for computer use. But I guess the world is complicated enough and computers got fast enough that having an AI look at a computer screen and move/click a mouse makes sense.
I just have to say that I consider this an absolutely hilarious outcome. For many years, I focused on tech solutions that eliminated the need for a human to be in front of a computer doing tedious manual operations. For a wide range of activities, I proposed we focus on "turning everything in the world into database objects" so that computers could operate on them with minimal human effort. I spent significant effort in machine learning to achieve this.
It didn't really occur to me that you could just train a computer to work directly on the semi-structured human world data (display screen buffer) through a human interface (mouse + keyboard).
However, I fully support it (like all the other crazy ideas on the web that beat out the "theoretically better" approaches). I do not think it is unrealistic to expect that within a decade, we could have computer systems that can open chrome, start a video chat with somebody, go back and forth for a while to achieve a task, then hang up... with the person on the other end ever knowing they were dealing with a computer instead of a human.
AI is succeeding where "theoretically better" approaches failed, because it addresses the underlying social problem. The computing ecosystem is an adversarial place, not a cooperative one. The reason we can't automate most of the tedium is by design - it's critical to how almost all money is made on the Internet. Can't monetize users when they automate your upsell channels and ad exposure away.
I saw similar discussions around robotics, people saying "why are they making the robots humanoid? couldn't they be a more efficient shape" and it comes back to the same thing where if you want the tool to be adopted then it has to fit in a human-centric world no matter how inefficient that is.
high performance applications are still always custom designed and streamlined, but mass adoption requires it to fit us not us to fit it.
I am on https://gemini.browserbase.com/ and just click the use case mentioned on the site "Go to Hacker News and find the most controversial post from today, then read the top 3 comments and summarize the debate."
It did not work, multiple times, just gets stuck after going to Hacker news.
It's a bit funny that I give Google Gemini a task and then it goes on the Google Search site and it gets stuck in the captcha tarpit that's supposed to block unwanted bots. But I guess Google Gemini shouldn't be unwanted for Google. Can't you ask the search team to whitelist the Gemini bot?
I think I'll make that my equivalent of Simon Willison's "pelican riding a bicycle" test. It is fairly simple to explain but seems to trip up different LLMs in different ways.
The rendered visual layout is designed in a way to be spatially organized perceptually to make sense. It's a bit like PDFs. I imagine that the underlying hierarchy tree can be quite messy and spaghetti, so your best bet is to use it in the form that the devs intended and tested it for.
I think screenshots are a really good and robust idea. It bothers the more structured-minded people, but apps are often not built so well. They are built until the point that it looks fine and people are able to use it. I'm pretty sure people who rely on accessibility systems have lots of complaints about this.
Not great at Google Sheets. Repeatedly overwrites all previous columns while trying to populate new columns.
> I am back in the Google Sheet. I previously typed "Zip Code" in F1, but it looks like I selected cell A1 and typed "A". I need to correct that first. I'll re-type "Zip Code" in F1 and clear A1. It seems I clicked A1 (y=219, x=72) then F1 (y=219, x=469) and typed "Zip Code", but then maybe clicked A1 again.
My general experience has been that Gemini is pretty bad at tool calling. The recent Gemini 2.5 Flash release actually fixed some of those issues but this one is Gemini 2.5 Pro with no indication about tool calling improvements.
How likely is it that the end game becomes that we stop writing apps for actual human users and instead sites become massive walls of minified text against a black screen.
> It is not yet optimized for desktop OS-level control
Alas, AGI is not yet here. But I feel like if this OS-level of control was good enough, and the cost of the LLM in the loop wasn't bad, maybe that would be enough to kick start something akin to AGI.
Interesting, seems to use 'pure' vision and x/y coords for clicking stuff. Most other browser automation with LLMs I've seen uses the dom/accessibility tree which absolutely churns through context, but is much more 'accurate' at clicking stuff because it can use the exact text/elements in a selector.
Unfortunately it really struggled in the demos for me. It took nearly 18 attempts to click the comment link on the HN demo, each a few pixels off.
[+] [-] xnx|5 months ago|reply
[+] [-] arkmm|5 months ago|reply
[+] [-] informal007|5 months ago|reply
[+] [-] iLoveOncall|5 months ago|reply
[+] [-] phamilton|5 months ago|reply
[+] [-] simonw|5 months ago|reply
[+] [-] jampa|5 months ago|reply
[+] [-] jrmann100|5 months ago|reply
[+] [-] subarctic|5 months ago|reply
[+] [-] siva7|5 months ago|reply
[+] [-] mohsen1|5 months ago|reply
Stucks with:
> ...the task is just to "solve today's Wordle", and as a web browsing robot, I cannot actually see the colors of the letters after a guess to make subsequent guesses. I can enter a word, but I cannot interpret the feedback (green, yellow, gray letters) to solve the puzzle.
[+] [-] jcims|5 months ago|reply
Knowing it's technically possible is one thing, but giving it a short command and seeing it go log in to a site, scroll around, reply to posts, etc. is eerie.
Also it tied me at wordle today, making the same mistake I did on the second to lass guess. Too bad you can't talk to it while it's working.
[+] [-] krawcu|5 months ago|reply
[+] [-] albert_e|5 months ago|reply
[+] [-] derekcheng08|5 months ago|reply
[+] [-] loandbehold|5 months ago|reply
[+] [-] dekhn|5 months ago|reply
This was long before computer vision was mature enough to do anything like that and I found out that instead, there are magnetic systems that can detect cars passing over - trivial hardware and software - and I concluded that my approach was just far too complicated and expensive.
Similarly, when I look at computers, I typically want the ML/AI system to operate on a structured data that is codified for computer use. But I guess the world is complicated enough and computers got fast enough that having an AI look at a computer screen and move/click a mouse makes sense.
[+] [-] dekhn|5 months ago|reply
It didn't really occur to me that you could just train a computer to work directly on the semi-structured human world data (display screen buffer) through a human interface (mouse + keyboard).
However, I fully support it (like all the other crazy ideas on the web that beat out the "theoretically better" approaches). I do not think it is unrealistic to expect that within a decade, we could have computer systems that can open chrome, start a video chat with somebody, go back and forth for a while to achieve a task, then hang up... with the person on the other end ever knowing they were dealing with a computer instead of a human.
[+] [-] TeMPOraL|5 months ago|reply
[+] [-] NothingAboutAny|5 months ago|reply
[+] [-] deegles|5 months ago|reply
all the pieces are there, though I suspect the first to implement this will be scammers and spear phishers.
[+] [-] omkar_savant|5 months ago|reply
[+] [-] sumedh|5 months ago|reply
It did not work, multiple times, just gets stuck after going to Hacker news.
[+] [-] bonoboTP|5 months ago|reply
[+] [-] SoKamil|5 months ago|reply
[+] [-] Awesomedonut|5 months ago|reply
[+] [-] ramoz|5 months ago|reply
Obviously much harder with UI vs agent events similar to the below.
https://docs.claude.com/en/docs/claude-code/hooks
https://google.github.io/adk-docs/callbacks/
[+] [-] realty_geek|5 months ago|reply
In the end I did manage to get it to play the housepriceguess game:
https://www.youtube.com/watch?v=nqYLhGyBOnM
I think I'll make that my equivalent of Simon Willison's "pelican riding a bicycle" test. It is fairly simple to explain but seems to trip up different LLMs in different ways.
[+] [-] CuriouslyC|5 months ago|reply
[+] [-] bonoboTP|5 months ago|reply
I think screenshots are a really good and robust idea. It bothers the more structured-minded people, but apps are often not built so well. They are built until the point that it looks fine and people are able to use it. I'm pretty sure people who rely on accessibility systems have lots of complaints about this.
[+] [-] ekelsen|5 months ago|reply
[+] [-] nicman23|5 months ago|reply
[+] [-] iAMkenough|5 months ago|reply
> I am back in the Google Sheet. I previously typed "Zip Code" in F1, but it looks like I selected cell A1 and typed "A". I need to correct that first. I'll re-type "Zip Code" in F1 and clear A1. It seems I clicked A1 (y=219, x=72) then F1 (y=219, x=469) and typed "Zip Code", but then maybe clicked A1 again.
[+] [-] omkar_savant|5 months ago|reply
[+] [-] whinvik|5 months ago|reply
[+] [-] skc|5 months ago|reply
[+] [-] fauigerzigerk|5 months ago|reply
If you're asking how likely it is that all human-computer interaction will take place via lengthy natural language conversations then my guess is no.
Visualising information and pointing at things is just too useful to replace it with what is essentially a smart command line interface.
[+] [-] barrenko|5 months ago|reply
[+] [-] password54321|5 months ago|reply
[+] [-] peytoncasper|5 months ago|reply
We’re partnering with them on Web Bot Auth
[+] [-] AaronAPU|5 months ago|reply
[+] [-] enjoylife|5 months ago|reply
Alas, AGI is not yet here. But I feel like if this OS-level of control was good enough, and the cost of the LLM in the loop wasn't bad, maybe that would be enough to kick start something akin to AGI.
[+] [-] pseidemann|5 months ago|reply
[+] [-] alganet|5 months ago|reply
[+] [-] cryptoz|5 months ago|reply
[+] [-] jebronie|5 months ago|reply
[+] [-] layman51|5 months ago|reply
[+] [-] martinald|5 months ago|reply
Unfortunately it really struggled in the demos for me. It took nearly 18 attempts to click the comment link on the HN demo, each a few pixels off.
[+] [-] pbhjpbhj|5 months ago|reply
[+] [-] unknown|5 months ago|reply
[deleted]