top | item 38200308

Using GPT-4 Vision with Vimium to browse the web

437 points| wvoch235 | 2 years ago |github.com

128 comments

order

e12e|2 years ago

It's insane that this is now possible:

https://github.com/ishan0102/vimGPT/blob/682b5e539541cd6d710...

> "You need to choose which action to take to help a user do this task: {objective}. Your options are navigate, type, click, and done. Navigate should take you to the specified URL. Type and click take strings where if you want to click on an object, return the string with the yellow character sequence you want to click on, and to type just a string with the message you want to type. For clicks, please only respond with the 1-2 letter sequence in the yellow box, and if there are multiple valid options choose the one you think a user would select. For typing, please return a click to click on the box along with a type with the message to write. When the page seems satisfactory, return done as a key with no value. You must respond in JSON only with no other fluff or bad things will happen. The JSON keys must ONLY be one of navigate, type, or click. Do not return the JSON inside a code block."

Maxion|2 years ago

The speed at which this is moving at is mind boggling. This may become crazier than the dot.com boom.

transistorfan|2 years ago

At my work there are a large contingent of people who essentially do manual data copying between legacy programs (govt), because the tech debt is so large that we can't figure out a way to plug these things together. Excited for tools like this to eventually act as a layer that can run over these sort of problems, as bizarre a solution as it is from a compute perspective

yreg|2 years ago

A long, long time ago I worked on a small project for a major multinational grocery chain.

I made them a tool that parses an Excel file with a specific structure and calls some endpoints in their internal system to submit the data.

I was curious, so I asked how they are doing it currently. They led me to a computer at the back of their office. The wallpaper had two rectangles, one of them said MS EXCEL and the other said INTERNET EXPLORER. Then the person opened these apps, carefully positioned both windows exactly into those rectangles and ran some auto-clicker - the kind cheaters would use in RuneScape – which moved the cursor and copied and pasted the values from the Excel into the various forms on the website.

Amazing.

bboygravity|2 years ago

Funny that you and others on here don't seem to realize that literally everybody who uses the internet has the exact same data entry problem all the time. Blame it on "old software", but how about the entire internet?

copying (or in most cases even worse: re-typing) form data from one location on the screen into yet another webform.

Username, password, email address, physical address, credit card info etc etc.

Some extensions try to help with data entry, but none of them work properly and consistently enough to really help. Even consistently filling just username and pw is too much to ask.

It's my number 1 frustration when using the internet (worse than ads) and I find it mind-blowing that this hasn't been solved yet with or without LLMs.

I would pay a montly fee for any software that solves this once and for all and it sounds like it's coming (and I'm already paying their monthly fee).

haswell|2 years ago

The industry buzzword is "Robotic Process Automation", which as a category of products has been focused on using various forms of ML/AI to glue these things together in a common/structured way (in addition to good old fashioned screen scraping).

Up this this point, these products have been quite brittle. The recent explosion of AI tech seems like quite a boon for this space.

Roark66|2 years ago

Whenever I hear about such a thing (people doing legacy system data extraction manually) I wonder if in every case someone got the estimate for the "proper" solution and just decided a bunch of people typing is cheaper?

Integrating things like Chatgpt will still require people who know what they are doing to look at it, and I wouldn't be surprised if the first advice they give is "don't use chatgpt for it".

aikinai|2 years ago

I remember years ago thinking it was weird in Ghost in the Shell when a robot had fingers on its fingers to type really fast. Maybe that really won’t happen since they can plug into USB at least, but they will probably use the screen and keyboard input sometimes at least.

hubraumhugo|2 years ago

I believe that LLMs will automate most of our data entry/copy/transformation work. 80% of the world's data is unstructured and scattered across formats like HTML, PDFs, or images that are hard to access and analyze. Multimodal models can now tap into that data without having to rely on complex OCR technologies or expensive tooling.

If you go to platforms like Upwork, there are thousands of VAs in low-cost labor countries that do nothing else than manual data entry work. IMO that's a complete waste of human capital and I've made it my personal mission to automate such tedious and un-creative data work with https://kadoa.com.

morkalork|2 years ago

Kinda sci-fi, we're so close to a future where when/if original source code is lost, a mainframe runs in an emulator and the human operating it is also emulated.

FooBarWidget|2 years ago

It's bizarre computationally, but at this point maybe we have to compare it to the alternative: hiring a person. At least the AI only consumes electricity (which is hopefully green), while a person consumes food (grown with mined fertilizers), or meat (which we know is really bad for the environment).

specialist|2 years ago

> a large contingent of people who essentially do manual data copying

Yup.

I was briefly part of a decades long effort to migrate off a main frame backend. It was basically a very expensive shared flat file database (eg FileMaker Pro). Used by thousands of applications, neither inventoried or managed. Surely a handful were critical for daily operations, but no one remembered which ones.

And the source data (quality) was filthy.

I suggested we pay some students to manually copy just the bits of data our spiffy "modern" apps needed.

No one was amused.

--

I also suggested we find a suitable COBOL runtime and just forklift the mainframe's "critical" infra into a virtual machine.

No one was amused.

Lastly, I suggested we throttle access to every unidentified mainframe client. Progressively making it slower over time. Surely we'd hear about anything critical breaking.

That suggestion flew like a lead zeppelin.

alexirobbins|2 years ago

Working on this layer at https://autotab.com. This sounds like an amazing problem for browser automation to solve, would love to talk with you if you’re interested!

Garlef|2 years ago

"Chinese Room Automation"

monkeydust|2 years ago

This has been fruitful ground for RPA offerings like UIPath and Automation Anywhere. Multi-model LLMs open up chance to disrupt them

gumballindie|2 years ago

Wow. Leaking confidential tax payer data.

lachlan_gray|2 years ago

I think vim is unintentionally a great “embodiment” for chatgpt. There’s nothing that can’t be done with a stream of text, and the internet is full of vimscript already

I started a similar experiment if anyone else is thinking along the same lines :)

https://github.com/LachlanGray/vim-agent

gsuuon|2 years ago

This is a neat idea!

ishan0102|2 years ago

Hey! Creator here, thanks for sharing! Let me know if anyone has questions and feel free to contribute, I've left some potential next steps in the README.

jimmySixDOF|2 years ago

Nice. I know Open Interpreter are trying to get Selenium automated to natural language control and quite a few other projects are also popping up on HN lately. The vimium approach is a lot lighter so looks promising. One way or another the as-published world wide web is turning into its own dynamic API overlay server. Ingest all the Sources!

jgalentine007|2 years ago

Very cool use for Vimium, I like the approach!

squeegmeister|2 years ago

How does this differ from how ChatGPT currently browses the web?

poulpy123|2 years ago

could it be used to make a bot that visit and parse websites to extrat relevant information without writing a parser for each websites ?

roland35|2 years ago

what terminal are you using???

maccam912|2 years ago

I've been playing with a similar idea of screenshots and actions from GPT-4 Vision for browsing, but after trying and failing to overlay info in the screenshot, I ended up just getting the accessibility tree from playwright and sending that along as text so the model would know what options it had for interaction. In my case it seemed to work better, I see the creator is here and has a list of future ideas, maybe add this to the list if you think its a good idea?

ishan0102|2 years ago

Cool that’s a solid idea, I was trying to only use visual data but this could make the agent a lot more powerful, I’ll try this really soon

manmal|2 years ago

Probably better to capture all the content and not just what fits on one screen. Most pages should fit as text (or HTML?) in the new extended token window.

mackross|2 years ago

Been playing with this through the ChatGPT interface for the past few weeks. Couple of tips. Update the css to get rid of the gradients and rounded corners. I found red with bold white text to be most consistent. Increase the font size. If two labels overlap, push them apart and add an arrow to the element. Send both images to the API, a version with the annotations added and a version without.

karmasimida|2 years ago

We can create an autopilot for browser.

It is going to incredibly difficult moving forward to distinguish bot traffic, if this is deployed at scale.

The problem I see is this isn't going to be cheap or even affordable in short term.

ishan0102|2 years ago

I think costs can come down if you finetune open source models like llava or cogvlm. This demo also cost about 6 cents so it's not insanely expensive either, especially with clever prompting.

reqo|2 years ago

How will tools like this affect web tracking or generally advertisements on the internet? Imagine you could have an agent browse the web for you and fetch exactly what you are seraching for without you seeing any ads/pop ups or being tracked along the way! Could be a great ”ad blocker”! Could it perhaps also make SEO useless and thus improve the quality of internet? But I wonder if it also could have negative effects such as the ads being “interweaved” into the fetch content somehow!

famouswaffles|2 years ago

Since this is sending screenshots of pages to GPT, won't it see the ads as well?

FooBarWidget|2 years ago

Many Dutch companies pay salaries by

1. receiving payslips from the accountant, and then

2. manually initiating bank transfers to each employee for the amount in the corresponding payslip, and then

3. manually initiating a bank transfer to the tax authority to pay the withholded salary taxes.

This is completely useless manual labor. There should be no reason for this to be a manual procedure. And yet it's almost impossible to automate this. The accountant portal either has no API, or it has an API but lets you download the data as PDF, and/or the API costs good money. The bank either has no API, or it requires you to sign up for a developer account as if you're going to publish a public app, when you're just looking to automate some internal procedures.

So the easiest way to pay salaries and taxes is still to hire a person to do it manually. Hopefully one day that won't be necessary anymore. I wouldn't trust an AI to actually initiate the bank transfers, but maybe they can just prepare the transactions and then a person has to approve the submission.

martinald|2 years ago

I don't think this really has much to do with AI. In the UK there are solutions like Pento now which do all this, including automating payments via open banking to the user and the tax authority and automatically filing tax filings:

https://www.pento.io/la/payroll-software

nvm0n2|2 years ago

That's just a bank problem. Certainly this isn't how payroll works for large companies. Banks usually let you upload XML files that define a set of SWIFT payments, this is how I do payroll even for a small company. The accountants supply the XML file too, presumably they have an app that generates it.

is_true|2 years ago

In my country it's similar but for some data you have to upload to the government agency's site, I think it was earlier this year that they released a statement saying that people using software to perform actions on the website could get banned.

snake_doc|2 years ago

Ah, very similar to Adept’s[1] concept? Though, their product seems not yet ready.

[1] https://www.adept.ai/

jatins|2 years ago

It's also a little insane to me that what Adept has been supposedly building for years with 300+ mil in funding can now be built in a day with Open AI APIs?

I think Adept pivoted along the way but original concept was very similar to this.

ishan0102|2 years ago

Yep, took inspiration from them and a couple other startups

karmasimida|2 years ago

This is precisely the demo I am thinking.

dangerwill|2 years ago

How is this making your browsing experience any better? You still have to know what you want to do, and it is just faster to type Rick roll into youtube directly and click the links directly instead of having to type k, or vh, or whatever. You are just adding a useless chatgpt middleman between you and the browser that you likely spend all day in anyway and should be adept at navigating

circuit10|2 years ago

It's a proof of concept for how it could do more complicated tasks

bnchrch|2 years ago

Personally. This is what Im really excited about chatgpt for. Data has just become alot more free to access.

burcs|2 years ago

This is amazing, I feel like these vision models are going to make everything so much more accessible. Between the Be My Eyes app integration and now this, I'm really excited for how this transforms the web.

ctoth|2 years ago

I agree, and I think we're a year or two away from a full end-to-end trained screen reader. The ground truth from existing systems would provide great training material.

As a technical blind person, my only concern is the inherent loss of privacy while sharing stuff with the big models.

ternaus|2 years ago

Love the idea.

It also shows that GPT-4V created a new angle in web scraping.

I guess, this or similar code would be leveraged in many projects like:

1. Scrape XXX websites, say LinkedIn or Twitter use all types of methods in the DOM to prevent it, but fighting working well GPT-4V + OCR would be ultra hard.

2. Give me an analysis of what these XXX companies are doing. And this could be done for competitors, to understand the landscape of some industry, or even plainly to get news.

Large-scale scrapping, not depending on the source code of the pages is a powerful infrastructural change.

sebastiennight|2 years ago

It took me a while to get what you meant, because... I'm not sure "XXX websites" usually means what you intended to convey here :)

DalasNoin|2 years ago

I tried to use it, but unfortunately it often did not add the little annotations for the different options to the screen and it got stuck in a loop. This bot works by adding a two letter combination to each clickable option, but sometimes they don't show up. It managed to sign in to twitter ones, but really quickly I burned through the 100 images api limit.

Maybe for a future version it only uses vision for difficult situations in which it gets stuck and otherwise uses the text based browser?

comment_ran|2 years ago

It's so cool. I was wondering if we can make crawler tool much easier and better. It's more similar to the "human" way to interact with a website.

ranulo|2 years ago

This could enable human language test automation scripts and could either improve my life as a QA engineer a lot or completely destroy it. Not sure yet.

sunshadow|2 years ago

You're good until this is cheaper than your salary.

jackconsidine|2 years ago

Looks extremely cool. Trying to run it though, I get stuck at "Getting actions for the given objective..." (using the example on the repo)

ishan0102|2 years ago

Huh weird, I'm getting that too. OpenAI has been having periodic outages today, think that might be why since it was working fine earlier.

silentguy|2 years ago

Usually there are a lot of comments about how text is the best interface and it's making a comeback in the LLMs but in this case picture is the better medium since parsing the webpage js would prove too difficult. I think a screenshot of a webpage has a smaller footprint than the raw payloads (js, assets, etc.).

snthpy|2 years ago

Looks cool. Unfortunately I expected this to enhance my Vimium experience but it looks like this is using Vimium to enhance GPT4, right?

silentguy|2 years ago

I think this can be extended to desktop as well. There are programs that act like vimium for your desktop (win-vind, etc.). I don't have the openai API key to try it but I wish someone gave it a try (in obviously an isolated environment).

jonathanlb|2 years ago

Hmm interesting. I'm curious what this means for accessibility and screen readers.

imranq|2 years ago

Is the vision model directly reading the screen and therefore also reading the Vimeo tags? It might be more effective to export the DOM tags and the associated elements as a Json object that is fed into chatGPT without using the vision component

dymk|2 years ago

> Currently the Vision API doesn't support JSON mode or function calling, so we have to rely on more primitive prompting methods.

gvv|2 years ago

Nice job! The horrors GPT-4 must endure to watch ads, truly inhumane

doctorM|2 years ago

i think this is actively dangerous. well not yet. but getting there.

i know - ai isn't meant to be sentient. but if it looks like a duck and quacks like a duck...

how do i know that the comments here aren't done by dedicated hacker news ai bots?

the potential danger could come from lack of supervision down the road.

i didn't get much sleep last night so this is less coherent than it could be.

braindead_in|2 years ago

Why not build a new browser with GPT baked in?

reustle|2 years ago

Curious, how would that differ? Assuming it is just grabbing the rendered HTML DOM after each action, isn’t it nearly the same?

owenpalmer|2 years ago

This will be fantastic for accessibility

nostrowski|2 years ago

This will be in a future history book under a chapter titled "the beginning of the end"

startages|2 years ago

There is just so much you can do with GPT-4 vision, I just hope it's more affordable.

bilekas|2 years ago

This is actually pretty interesting.. I am thinking maybe it would be faster than writing up selenium tests themselves if we could just give a few instructions.

I'm still going through the source, but really nice idea and great example of enriching the GPT with tools like vimium.

rpigab|2 years ago

This is amazing that it's possible and works, but I wonder if the electricity cost is sustainable in the long run.

For handicapped people who depend on tools like this for accessibility, it's justified, but I wouldn't use it myself if it uses too much power.

I'm sure OpenAI and friends love operating at a loss until everyone uses their products, then enshittify or raise prices, like Netflix, Microsoft, Google, etc., but CO2 emissions can't be easily reversed.

I'd be glad to listen to other points of view though, maybe everything we do on computers is already bad for the environment anyway and comparing which one pollutes more is vain, idk.