> "You need to choose which action to take to help a user do this task: {objective}. Your options are navigate, type, click, and done. Navigate should take you to the specified URL. Type and click take strings where if you want to click on an object, return the string with the yellow character sequence you want to click on, and to type just a string with the message you want to type. For clicks, please only respond with the 1-2 letter sequence in the yellow box, and if there are multiple valid options choose the one you think a user would select. For typing, please return a click to click on the box along with a type with the message to write. When the page seems satisfactory, return done as a key with no value. You must respond in JSON only with no other fluff or bad things will happen. The JSON keys must ONLY be one of navigate, type, or click. Do not return the JSON inside a code block."
At my work there are a large contingent of people who essentially do manual data copying between legacy programs (govt), because the tech debt is so large that we can't figure out a way to plug these things together. Excited for tools like this to eventually act as a layer that can run over these sort of problems, as bizarre a solution as it is from a compute perspective
A long, long time ago I worked on a small project for a major multinational grocery chain.
I made them a tool that parses an Excel file with a specific structure and calls some endpoints in their internal system to submit the data.
I was curious, so I asked how they are doing it currently. They led me to a computer at the back of their office. The wallpaper had two rectangles, one of them said MS EXCEL and the other said INTERNET EXPLORER. Then the person opened these apps, carefully positioned both windows exactly into those rectangles and ran some auto-clicker - the kind cheaters would use in RuneScape – which moved the cursor and copied and pasted the values from the Excel into the various forms on the website.
Funny that you and others on here don't seem to realize that literally everybody who uses the internet has the exact same data entry problem all the time. Blame it on "old software", but how about the entire internet?
copying (or in most cases even worse: re-typing) form data from one location on the screen into yet another webform.
Username, password, email address, physical address, credit card info etc etc.
Some extensions try to help with data entry, but none of them work properly and consistently enough to really help. Even consistently filling just username and pw is too much to ask.
It's my number 1 frustration when using the internet (worse than ads) and I find it mind-blowing that this hasn't been solved yet with or without LLMs.
I would pay a montly fee for any software that solves this once and for all and it sounds like it's coming (and I'm already paying their monthly fee).
The industry buzzword is "Robotic Process Automation", which as a category of products has been focused on using various forms of ML/AI to glue these things together in a common/structured way (in addition to good old fashioned screen scraping).
Up this this point, these products have been quite brittle. The recent explosion of AI tech seems like quite a boon for this space.
Whenever I hear about such a thing (people doing legacy system data extraction manually) I wonder if in every case someone got the estimate for the "proper" solution and just decided a bunch of people typing is cheaper?
Integrating things like Chatgpt will still require people who know what they are doing to look at it, and I wouldn't be surprised if the first advice they give is "don't use chatgpt for it".
I remember years ago thinking it was weird in Ghost in the Shell when a robot had fingers on its fingers to type really fast. Maybe that really won’t happen since they can plug into USB at least, but they will probably use the screen and keyboard input sometimes at least.
I believe that LLMs will automate most of our data entry/copy/transformation work.
80% of the world's data is unstructured and scattered across formats like HTML, PDFs, or images that are hard to access and analyze. Multimodal models can now tap into that data without having to rely on complex OCR technologies or expensive tooling.
If you go to platforms like Upwork, there are thousands of VAs in low-cost labor countries that do nothing else than manual data entry work.
IMO that's a complete waste of human capital and I've made it my personal mission to automate such tedious and un-creative data work with https://kadoa.com.
Kinda sci-fi, we're so close to a future where when/if original source code is lost, a mainframe runs in an emulator and the human operating it is also emulated.
It's bizarre computationally, but at this point maybe we have to compare it to the alternative: hiring a person. At least the AI only consumes electricity (which is hopefully green), while a person consumes food (grown with mined fertilizers), or meat (which we know is really bad for the environment).
> a large contingent of people who essentially do manual data copying
Yup.
I was briefly part of a decades long effort to migrate off a main frame backend. It was basically a very expensive shared flat file database (eg FileMaker Pro). Used by thousands of applications, neither inventoried or managed. Surely a handful were critical for daily operations, but no one remembered which ones.
And the source data (quality) was filthy.
I suggested we pay some students to manually copy just the bits of data our spiffy "modern" apps needed.
No one was amused.
--
I also suggested we find a suitable COBOL runtime and just forklift the mainframe's "critical" infra into a virtual machine.
No one was amused.
Lastly, I suggested we throttle access to every unidentified mainframe client. Progressively making it slower over time. Surely we'd hear about anything critical breaking.
Working on this layer at https://autotab.com. This sounds like an amazing problem for browser automation to solve, would love to talk with you if you’re interested!
I think vim is unintentionally a great “embodiment” for chatgpt. There’s nothing that can’t be done with a stream of text, and the internet is full of vimscript already
I started a similar experiment if anyone else is thinking along the same lines :)
Hey! Creator here, thanks for sharing! Let me know if anyone has questions and feel free to contribute, I've left some potential next steps in the README.
Nice. I know Open Interpreter are trying to get Selenium automated to natural language control and quite a few other projects are also popping up on HN lately. The vimium approach is a lot lighter so looks promising. One way or another the as-published world wide web is turning into its own dynamic API overlay server. Ingest all the Sources!
I've been playing with a similar idea of screenshots and actions from GPT-4 Vision for browsing, but after trying and failing to overlay info in the screenshot, I ended up just getting the accessibility tree from playwright and sending that along as text so the model would know what options it had for interaction. In my case it seemed to work better, I see the creator is here and has a list of future ideas, maybe add this to the list if you think its a good idea?
Probably better to capture all the content and not just what fits on one screen. Most pages should fit as text (or HTML?) in the new extended token window.
Been playing with this through the ChatGPT interface for the past few weeks. Couple of tips. Update the css to get rid of the gradients and rounded corners. I found red with bold white text to be most consistent. Increase the font size. If two labels overlap, push them apart and add an arrow to the element. Send both images to the API, a version with the annotations added and a version without.
I think costs can come down if you finetune open source models like llava or cogvlm. This demo also cost about 6 cents so it's not insanely expensive either, especially with clever prompting.
How will tools like this affect web tracking or generally advertisements on the internet? Imagine you could have an agent browse the web for you and fetch exactly what you are seraching for without you seeing any ads/pop ups or being tracked along the way! Could be a great ”ad blocker”! Could it perhaps also make SEO useless and thus improve the quality of internet?
But I wonder if it also could have negative effects such as the ads being “interweaved” into the fetch content somehow!
1. receiving payslips from the accountant, and then
2. manually initiating bank transfers to each employee for the amount in the corresponding payslip, and then
3. manually initiating a bank transfer to the tax authority to pay the withholded salary taxes.
This is completely useless manual labor. There should be no reason for this to be a manual procedure. And yet it's almost impossible to automate this. The accountant portal either has no API, or it has an API but lets you download the data as PDF, and/or the API costs good money. The bank either has no API, or it requires you to sign up for a developer account as if you're going to publish a public app, when you're just looking to automate some internal procedures.
So the easiest way to pay salaries and taxes is still to hire a person to do it manually. Hopefully one day that won't be necessary anymore. I wouldn't trust an AI to actually initiate the bank transfers, but maybe they can just prepare the transactions and then a person has to approve the submission.
I don't think this really has much to do with AI. In the UK there are solutions like Pento now which do all this, including automating payments via open banking to the user and the tax authority and automatically filing tax filings:
That's just a bank problem. Certainly this isn't how payroll works for large companies. Banks usually let you upload XML files that define a set of SWIFT payments, this is how I do payroll even for a small company. The accountants supply the XML file too, presumably they have an app that generates it.
In my country it's similar but for some data you have to upload to the government agency's site, I think it was earlier this year that they released a statement saying that people using software to perform actions on the website could get banned.
It's also a little insane to me that what Adept has been supposedly building for years with 300+ mil in funding can now be built in a day with Open AI APIs?
I think Adept pivoted along the way but original concept was very similar to this.
How is this making your browsing experience any better? You still have to know what you want to do, and it is just faster to type Rick roll into youtube directly and click the links directly instead of having to type k, or vh, or whatever. You are just adding a useless chatgpt middleman between you and the browser that you likely spend all day in anyway and should be adept at navigating
This is amazing, I feel like these vision models are going to make everything so much more accessible. Between the Be My Eyes app integration and now this, I'm really excited for how this transforms the web.
I agree, and I think we're a year or two away from a full end-to-end trained screen reader. The ground truth from existing systems would provide great training material.
As a technical blind person, my only concern is the inherent loss of privacy while sharing stuff with the big models.
It also shows that GPT-4V created a new angle in web scraping.
I guess, this or similar code would be leveraged in many projects like:
1. Scrape XXX websites, say LinkedIn or Twitter use all types of methods in the DOM to prevent it, but fighting working well GPT-4V + OCR would be ultra hard.
2. Give me an analysis of what these XXX companies are doing. And this could be done for competitors, to understand the landscape of some industry, or even plainly to get news.
Large-scale scrapping, not depending on the source code of the pages is a powerful infrastructural change.
I tried to use it, but unfortunately it often did not add the little annotations for the different options to the screen and it got stuck in a loop. This bot works by adding a two letter combination to each clickable option, but sometimes they don't show up. It managed to sign in to twitter ones, but really quickly I burned through the 100 images api limit.
Maybe for a future version it only uses vision for difficult situations in which it gets stuck and otherwise uses the text based browser?
This could enable human language test automation scripts and could either improve my life as a QA engineer a lot or completely destroy it. Not sure yet.
Usually there are a lot of comments about how text is the best interface and it's making a comeback in the LLMs but in this case picture is the better medium since parsing the webpage js would prove too difficult. I think a screenshot of a webpage has a smaller footprint than the raw payloads (js, assets, etc.).
I think this can be extended to desktop as well. There are programs that act like vimium for your desktop (win-vind, etc.). I don't have the openai API key to try it but I wish someone gave it a try (in obviously an isolated environment).
Is the vision model directly reading the screen and therefore also reading the Vimeo tags? It might be more effective to export the DOM tags and the associated elements as a Json object that is fed into chatGPT without using the vision component
This is actually pretty interesting.. I am thinking maybe it would be faster than writing up selenium tests themselves if we could just give a few instructions.
I'm still going through the source, but really nice idea and great example of enriching the GPT with tools like vimium.
This is amazing that it's possible and works, but I wonder if the electricity cost is sustainable in the long run.
For handicapped people who depend on tools like this for accessibility, it's justified, but I wouldn't use it myself if it uses too much power.
I'm sure OpenAI and friends love operating at a loss until everyone uses their products, then enshittify or raise prices, like Netflix, Microsoft, Google, etc., but CO2 emissions can't be easily reversed.
I'd be glad to listen to other points of view though, maybe everything we do on computers is already bad for the environment anyway and comparing which one pollutes more is vain, idk.
e12e|2 years ago
https://github.com/ishan0102/vimGPT/blob/682b5e539541cd6d710...
> "You need to choose which action to take to help a user do this task: {objective}. Your options are navigate, type, click, and done. Navigate should take you to the specified URL. Type and click take strings where if you want to click on an object, return the string with the yellow character sequence you want to click on, and to type just a string with the message you want to type. For clicks, please only respond with the 1-2 letter sequence in the yellow box, and if there are multiple valid options choose the one you think a user would select. For typing, please return a click to click on the box along with a type with the message to write. When the page seems satisfactory, return done as a key with no value. You must respond in JSON only with no other fluff or bad things will happen. The JSON keys must ONLY be one of navigate, type, or click. Do not return the JSON inside a code block."
Maxion|2 years ago
transistorfan|2 years ago
yreg|2 years ago
I made them a tool that parses an Excel file with a specific structure and calls some endpoints in their internal system to submit the data.
I was curious, so I asked how they are doing it currently. They led me to a computer at the back of their office. The wallpaper had two rectangles, one of them said MS EXCEL and the other said INTERNET EXPLORER. Then the person opened these apps, carefully positioned both windows exactly into those rectangles and ran some auto-clicker - the kind cheaters would use in RuneScape – which moved the cursor and copied and pasted the values from the Excel into the various forms on the website.
Amazing.
bboygravity|2 years ago
copying (or in most cases even worse: re-typing) form data from one location on the screen into yet another webform.
Username, password, email address, physical address, credit card info etc etc.
Some extensions try to help with data entry, but none of them work properly and consistently enough to really help. Even consistently filling just username and pw is too much to ask.
It's my number 1 frustration when using the internet (worse than ads) and I find it mind-blowing that this hasn't been solved yet with or without LLMs.
I would pay a montly fee for any software that solves this once and for all and it sounds like it's coming (and I'm already paying their monthly fee).
haswell|2 years ago
Up this this point, these products have been quite brittle. The recent explosion of AI tech seems like quite a boon for this space.
Roark66|2 years ago
Integrating things like Chatgpt will still require people who know what they are doing to look at it, and I wouldn't be surprised if the first advice they give is "don't use chatgpt for it".
aikinai|2 years ago
hubraumhugo|2 years ago
If you go to platforms like Upwork, there are thousands of VAs in low-cost labor countries that do nothing else than manual data entry work. IMO that's a complete waste of human capital and I've made it my personal mission to automate such tedious and un-creative data work with https://kadoa.com.
morkalork|2 years ago
FooBarWidget|2 years ago
specialist|2 years ago
Yup.
I was briefly part of a decades long effort to migrate off a main frame backend. It was basically a very expensive shared flat file database (eg FileMaker Pro). Used by thousands of applications, neither inventoried or managed. Surely a handful were critical for daily operations, but no one remembered which ones.
And the source data (quality) was filthy.
I suggested we pay some students to manually copy just the bits of data our spiffy "modern" apps needed.
No one was amused.
--
I also suggested we find a suitable COBOL runtime and just forklift the mainframe's "critical" infra into a virtual machine.
No one was amused.
Lastly, I suggested we throttle access to every unidentified mainframe client. Progressively making it slower over time. Surely we'd hear about anything critical breaking.
That suggestion flew like a lead zeppelin.
alexirobbins|2 years ago
abrichr|2 years ago
unknown|2 years ago
[deleted]
Garlef|2 years ago
monkeydust|2 years ago
gumballindie|2 years ago
lachlan_gray|2 years ago
I started a similar experiment if anyone else is thinking along the same lines :)
https://github.com/LachlanGray/vim-agent
gsuuon|2 years ago
ishan0102|2 years ago
celeste_lan|2 years ago
jimmySixDOF|2 years ago
jgalentine007|2 years ago
squeegmeister|2 years ago
poulpy123|2 years ago
roland35|2 years ago
maccam912|2 years ago
ishan0102|2 years ago
manmal|2 years ago
mackross|2 years ago
karmasimida|2 years ago
It is going to incredibly difficult moving forward to distinguish bot traffic, if this is deployed at scale.
The problem I see is this isn't going to be cheap or even affordable in short term.
ishan0102|2 years ago
reqo|2 years ago
famouswaffles|2 years ago
FooBarWidget|2 years ago
1. receiving payslips from the accountant, and then
2. manually initiating bank transfers to each employee for the amount in the corresponding payslip, and then
3. manually initiating a bank transfer to the tax authority to pay the withholded salary taxes.
This is completely useless manual labor. There should be no reason for this to be a manual procedure. And yet it's almost impossible to automate this. The accountant portal either has no API, or it has an API but lets you download the data as PDF, and/or the API costs good money. The bank either has no API, or it requires you to sign up for a developer account as if you're going to publish a public app, when you're just looking to automate some internal procedures.
So the easiest way to pay salaries and taxes is still to hire a person to do it manually. Hopefully one day that won't be necessary anymore. I wouldn't trust an AI to actually initiate the bank transfers, but maybe they can just prepare the transactions and then a person has to approve the submission.
martinald|2 years ago
https://www.pento.io/la/payroll-software
nvm0n2|2 years ago
is_true|2 years ago
abrichr|2 years ago
Automating repetitive GUI workflows is the goal of https://github.com/OpenAdaptAI/OpenAdapt
snake_doc|2 years ago
[1] https://www.adept.ai/
jatins|2 years ago
I think Adept pivoted along the way but original concept was very similar to this.
amks|2 years ago
ishan0102|2 years ago
karmasimida|2 years ago
dangerwill|2 years ago
circuit10|2 years ago
bnchrch|2 years ago
thekid314|2 years ago
ishan0102|2 years ago
[1] https://platform.openai.com/docs/guides/vision
burcs|2 years ago
ctoth|2 years ago
As a technical blind person, my only concern is the inherent loss of privacy while sharing stuff with the big models.
ternaus|2 years ago
It also shows that GPT-4V created a new angle in web scraping.
I guess, this or similar code would be leveraged in many projects like:
1. Scrape XXX websites, say LinkedIn or Twitter use all types of methods in the DOM to prevent it, but fighting working well GPT-4V + OCR would be ultra hard.
2. Give me an analysis of what these XXX companies are doing. And this could be done for competitors, to understand the landscape of some industry, or even plainly to get news.
Large-scale scrapping, not depending on the source code of the pages is a powerful infrastructural change.
sebastiennight|2 years ago
DalasNoin|2 years ago
Maybe for a future version it only uses vision for difficult situations in which it gets stuck and otherwise uses the text based browser?
comment_ran|2 years ago
ranulo|2 years ago
sunshadow|2 years ago
jackconsidine|2 years ago
ishan0102|2 years ago
silentguy|2 years ago
snthpy|2 years ago
silentguy|2 years ago
jonathanlb|2 years ago
imranq|2 years ago
dymk|2 years ago
gvv|2 years ago
doctorM|2 years ago
i know - ai isn't meant to be sentient. but if it looks like a duck and quacks like a duck...
how do i know that the comments here aren't done by dedicated hacker news ai bots?
the potential danger could come from lack of supervision down the road.
i didn't get much sleep last night so this is less coherent than it could be.
braindead_in|2 years ago
reustle|2 years ago
owenpalmer|2 years ago
nostrowski|2 years ago
startages|2 years ago
mediumsmart|2 years ago
https://www.youtube.com/watch?v=jRyX1tC2OS0
bilekas|2 years ago
I'm still going through the source, but really nice idea and great example of enriching the GPT with tools like vimium.
rpigab|2 years ago
For handicapped people who depend on tools like this for accessibility, it's justified, but I wouldn't use it myself if it uses too much power.
I'm sure OpenAI and friends love operating at a loss until everyone uses their products, then enshittify or raise prices, like Netflix, Microsoft, Google, etc., but CO2 emissions can't be easily reversed.
I'd be glad to listen to other points of view though, maybe everything we do on computers is already bad for the environment anyway and comparing which one pollutes more is vain, idk.
rizpaki|2 years ago
[deleted]
rizpaki|2 years ago
[deleted]