top | item 45507936

Gemini 2.5 Computer Use model

636 points| mfiguiere | 5 months ago |blog.google

325 comments

order
[+] xnx|5 months ago|reply
I've had good success with the Chrome devtools MCP (https://github.com/ChromeDevTools/chrome-devtools-mcp) for browser automation with Gemini CLI, so I'm guessing this model will work even better.
[+] arkmm|5 months ago|reply
What sorts of automations were you able to get working with the Chrome dev tools MCP?
[+] informal007|5 months ago|reply
Computer use model comes from interactive demand with computer automatically, Chrome devtools MCP might be one of the core pushers.
[+] iLoveOncall|5 months ago|reply
This has absolutely nothing in common with a model for computer use... This uses pre-defined tools provided in the MCP server by Google, nothing to do with a general model supposed to work for any software.
[+] phamilton|5 months ago|reply
It successfully got through the captcha at https://www.google.com/recaptcha/api2/demo
[+] jampa|5 months ago|reply
The automation is powered through Browserbase, which has a captcha solver. (Whether it is automated or human, I don't know.)
[+] subarctic|5 months ago|reply
Now we just need something to solve captchas for us when we're browsing normally
[+] siva7|5 months ago|reply
probably because its ip is coming from googles own subnet
[+] mohsen1|5 months ago|reply
> Solve today's Wordle

Stucks with:

> ...the task is just to "solve today's Wordle", and as a web browsing robot, I cannot actually see the colors of the letters after a guess to make subsequent guesses. I can enter a word, but I cannot interpret the feedback (green, yellow, gray letters) to solve the puzzle.

[+] jcims|5 months ago|reply
(Just using the browserbase demo)

Knowing it's technically possible is one thing, but giving it a short command and seeing it go log in to a site, scroll around, reply to posts, etc. is eerie.

Also it tied me at wordle today, making the same mistake I did on the second to lass guess. Too bad you can't talk to it while it's working.

[+] krawcu|5 months ago|reply
I wonder how it would behave in a scenario where it has to download some file from a shady website that has all those advertisement with fake "download"
[+] albert_e|5 months ago|reply
I believe it will need very capable but small VLMs that understand common User Interfaces very well -- small enough to run locally -- paired with any other higher level models on the cloud, to achieve human-speed interactions and beyond with reliability.
[+] derekcheng08|5 months ago|reply
Really feels like computer use models may be vertical agent killers once they get good enough. Many knowledge work domains boil down to: use a web app, send an email. (e.g. recruiting, sales outreach)
[+] loandbehold|5 months ago|reply
Why do you need an agent use web app through UI? Can't agent be integrated into web app natively? IMO for verticals you mentioned the missing piece is for an agent to be able to make phone calls.
[+] dekhn|5 months ago|reply
Many years ago I was sitting at a red light on a secondary road, where the primary cross road was idle. It seemed like you could solve this using a computer vision camera system that watched the primary road and when it was idle, would expedite the secondary road's green light.

This was long before computer vision was mature enough to do anything like that and I found out that instead, there are magnetic systems that can detect cars passing over - trivial hardware and software - and I concluded that my approach was just far too complicated and expensive.

Similarly, when I look at computers, I typically want the ML/AI system to operate on a structured data that is codified for computer use. But I guess the world is complicated enough and computers got fast enough that having an AI look at a computer screen and move/click a mouse makes sense.

[+] dekhn|5 months ago|reply
I just have to say that I consider this an absolutely hilarious outcome. For many years, I focused on tech solutions that eliminated the need for a human to be in front of a computer doing tedious manual operations. For a wide range of activities, I proposed we focus on "turning everything in the world into database objects" so that computers could operate on them with minimal human effort. I spent significant effort in machine learning to achieve this.

It didn't really occur to me that you could just train a computer to work directly on the semi-structured human world data (display screen buffer) through a human interface (mouse + keyboard).

However, I fully support it (like all the other crazy ideas on the web that beat out the "theoretically better" approaches). I do not think it is unrealistic to expect that within a decade, we could have computer systems that can open chrome, start a video chat with somebody, go back and forth for a while to achieve a task, then hang up... with the person on the other end ever knowing they were dealing with a computer instead of a human.

[+] TeMPOraL|5 months ago|reply
AI is succeeding where "theoretically better" approaches failed, because it addresses the underlying social problem. The computing ecosystem is an adversarial place, not a cooperative one. The reason we can't automate most of the tedium is by design - it's critical to how almost all money is made on the Internet. Can't monetize users when they automate your upsell channels and ad exposure away.
[+] NothingAboutAny|5 months ago|reply
I saw similar discussions around robotics, people saying "why are they making the robots humanoid? couldn't they be a more efficient shape" and it comes back to the same thing where if you want the tool to be adopted then it has to fit in a human-centric world no matter how inefficient that is. high performance applications are still always custom designed and streamlined, but mass adoption requires it to fit us not us to fit it.
[+] deegles|5 months ago|reply
> computer systems that can open chrome, start a video chat with somebody, go back and forth for a while to achieve a task, then hang up...

all the pieces are there, though I suspect the first to implement this will be scammers and spear phishers.

[+] omkar_savant|5 months ago|reply
Hey - I'm on the team that launched this. Please let me know if you have any questions!
[+] sumedh|5 months ago|reply
I am on https://gemini.browserbase.com/ and just click the use case mentioned on the site "Go to Hacker News and find the most controversial post from today, then read the top 3 comments and summarize the debate."

It did not work, multiple times, just gets stuck after going to Hacker news.

[+] bonoboTP|5 months ago|reply
It's a bit funny that I give Google Gemini a task and then it goes on the Google Search site and it gets stuck in the captcha tarpit that's supposed to block unwanted bots. But I guess Google Gemini shouldn't be unwanted for Google. Can't you ask the search team to whitelist the Gemini bot?
[+] SoKamil|5 months ago|reply
How are you going to deal with reCAPTCHA and ad impressions? Sounds like a conflict of interest.
[+] Awesomedonut|5 months ago|reply
Really cool stuff! Any interesting challenges the team ran into while developing it?
[+] realty_geek|5 months ago|reply
Absolutely hilarious how it gets stuck trying to solve captcha each time. I had to explicitly tell it not to go to google first.

In the end I did manage to get it to play the housepriceguess game:

https://www.youtube.com/watch?v=nqYLhGyBOnM

I think I'll make that my equivalent of Simon Willison's "pelican riding a bicycle" test. It is fairly simple to explain but seems to trip up different LLMs in different ways.

[+] CuriouslyC|5 months ago|reply
I feel like screenshots should be the last thing you reach for. There's a whole universe of data from accessibility subsystems.
[+] bonoboTP|5 months ago|reply
The rendered visual layout is designed in a way to be spatially organized perceptually to make sense. It's a bit like PDFs. I imagine that the underlying hierarchy tree can be quite messy and spaghetti, so your best bet is to use it in the form that the devs intended and tested it for.

I think screenshots are a really good and robust idea. It bothers the more structured-minded people, but apps are often not built so well. They are built until the point that it looks fine and people are able to use it. I'm pretty sure people who rely on accessibility systems have lots of complaints about this.

[+] ekelsen|5 months ago|reply
and all sorts of situations where they don't work. When they do work it's great, but if they don't and you rely on them, you have nothing.
[+] iAMkenough|5 months ago|reply
Not great at Google Sheets. Repeatedly overwrites all previous columns while trying to populate new columns.

> I am back in the Google Sheet. I previously typed "Zip Code" in F1, but it looks like I selected cell A1 and typed "A". I need to correct that first. I'll re-type "Zip Code" in F1 and clear A1. It seems I clicked A1 (y=219, x=72) then F1 (y=219, x=469) and typed "Zip Code", but then maybe clicked A1 again.

[+] omkar_savant|5 months ago|reply
Could you share your prompt? We'll look into this one
[+] whinvik|5 months ago|reply
My general experience has been that Gemini is pretty bad at tool calling. The recent Gemini 2.5 Flash release actually fixed some of those issues but this one is Gemini 2.5 Pro with no indication about tool calling improvements.
[+] skc|5 months ago|reply
How likely is it that the end game becomes that we stop writing apps for actual human users and instead sites become massive walls of minified text against a black screen.
[+] fauigerzigerk|5 months ago|reply
If some functionality isn't used directly by humans why not expose it as an API?

If you're asking how likely it is that all human-computer interaction will take place via lengthy natural language conversations then my guess is no.

Visualising information and pointing at things is just too useful to replace it with what is essentially a smart command line interface.

[+] barrenko|5 months ago|reply
Hopefully we get entirely off the internet.
[+] password54321|5 months ago|reply
It is all just data. It doesn't need to be rendered to become input.
[+] peytoncasper|5 months ago|reply
Actually a few startups working on this! You should check out Stytch isAgent SDK.

We’re partnering with them on Web Bot Auth

[+] AaronAPU|5 months ago|reply
I’m looking forward to a desktop OS optimized version so it can do the QA that I have no time for!
[+] enjoylife|5 months ago|reply
> It is not yet optimized for desktop OS-level control

Alas, AGI is not yet here. But I feel like if this OS-level of control was good enough, and the cost of the LLM in the loop wasn't bad, maybe that would be enough to kick start something akin to AGI.

[+] pseidemann|5 months ago|reply
Funny thing is, most humans cannot properly control a computer. Intelligence seems to be impossible to define.
[+] alganet|5 months ago|reply
I am curious. Why do you think controlling an OS (and not just a browser) would be a move towards AGI?
[+] cryptoz|5 months ago|reply
Computer Use models are going to ruin simple honeypot form fields meant to detect bots :(
[+] jebronie|5 months ago|reply
I just tried to submit a contact form with it. It successfully solved the ReCaptcha but failed to fill in a required field and got stuck. We're safe.
[+] layman51|5 months ago|reply
You mean the ones where people add a question that is like "What is 10+3?"
[+] martinald|5 months ago|reply
Interesting, seems to use 'pure' vision and x/y coords for clicking stuff. Most other browser automation with LLMs I've seen uses the dom/accessibility tree which absolutely churns through context, but is much more 'accurate' at clicking stuff because it can use the exact text/elements in a selector.

Unfortunately it really struggled in the demos for me. It took nearly 18 attempts to click the comment link on the HN demo, each a few pixels off.

[+] pbhjpbhj|5 months ago|reply
18 attempts - emulating the human HN experience when using mobile. Well, assuming it hit other links it didn't intend to anyway. /jk