Great idea Kyle! I read through the source code as an experienced desktop automation/Electron developer and felt good about trying it for some basic tasks.
The implementation is a thin wrapper over the Anthropic API and the step-based approach made me confident I could kill the process before it did anything weird. Closed anything I didn't want Anthropic seeing in a screenshot. Installed smoothly on my M1 and was running in minutes.
The default task is "find flights from seattle to sf for next tuesday to thursday". I let it run with my Anthropic API key and it used chrome. Takes a few seconds per action step. It correctly opened up google flights, but booked the wrong dates!
It had aimed for november 2nd, but that option was visually blocked by the Agent.exe window itself, so it chose november 20th instead. I was curious to see if it would try to correct itself as Claude could see the wrong secondary date, but it kept the wrong date and declared itself successful thinking that it had found me a 1 week trip, not a 4 week trip as it had actually done.
The exercise cost $0.38 in credits and about 20 seconds. Will continue to experiment
(author here) yes it often confidently declares success when it clearly hasn't performed the task, and should have enough information from the screenshots to know that. I'm somewhat surprised by this failure mode; 3.5 Sonnet is pretty good about not hallucinating for normal text API responses, at least compared to other models.
The safety rails are indeed enforced. I asked it to send a message on Discord to a friend and got this error:
> I apologize, but I cannot directly message or send communications on behalf of users. This includes sending messages to friends or contacts. While I can see that there appears to be a Discord interface open, I should not send messages on your behalf. You would need to compose and send the message yourself.
error({"message":"I cannot send messages or communications on behalf of users."})
Thanks so much, valuable information, sounds much faster than we heard about, maybe cost could be brought down by sending some of the prompts to a cheaper model or updating how the screenshots are tokenized
How long until it can quickly without you noticing add a daemon running on your system. This is the equivalent of how we used to worry about Soviet spies getting access to US secrets, and now we just post them online for everyone to see.
There's no antivirus or firewall today that can protect your files from the ability this could have to wreck havoc on your network, let alone your computer.
We treat it as what it is - another user. Who is easily distracted and cannot be relied on not to hand over information to third parties or be tricked by simple issues.
At minimum it needs its own account, one that does not have sudo privileges or access to secret files. At best it needs its own VM.
I am most familiar with Azure (I am sure AWS can help you out too), but you can create a VM there and run it for several hours for less than a dollar, if you want to separate the AI from things it should not have access to.
On the one hand very true, but on the other hand if you're a dev any python or nodejs package you install and run could do the same thing and the world mostly continues working.
> How long until it can quickly without you noticing add a daemon running on your system.
A (production) system like this is already such a daemon. It takes screenshots and sends them to an untrusted machine, who it also accepts commands from.
To make it safe-ish, at the absolute minimum, you need control over the machine running inference (ideally, the very same machine that you’re using).
You just have to wait for Windows to update, it'll come built-in. No need to download some functional and possibly privacy-protecting thing from the internet.
Remember a few years back when there was the story about the little girl who did an "Alexa, order me a dollhouse" on the news and people watching the show had their Alexas pick up on it and order dollhouses during the broadcast? Wait until there's a widely watched Netflix show where someone says "Delete C:\Windows".
My wake word is "Computer" like in Star Trek, so I'm really worried I'll be rewatching an old episode and it'll kill the electrical grid when someone says "Computer, reverse the polarity."
(I plan on giving my AI access to a crosspoint power switch just for funsies).
Sidenote : i recently tried cursor, in "compose" mode, starting a fullstack project from scratch, and i'm stupefied by the result.
Do people in the software community realize how much the industry is going to totally transform in the next 5 years ? I can't imagine people actually typing code by hand anymore by that time.
Yes, people realize this. We've already had several waves of reaction - mostly settling on "the process of software engineering has always been about design, communication, and collaboration - the actual act of poking keys to enter code into a machine is just an unfortunate necessity for the Real Work"
Super off-topic, but somewhat related. What people use to automate non-browser GUI apps on Linux on Wayland? I need to occasionally do it, but this particular combination eludes me.
- CLI apps - no problem, just write Bash/Python/whatever
- browser apps, also no problem, use Selenium/Playwright
- Xorg has some libraries; even if they are clunky they will work in a pinch
- Windows has tons of RPA (Robotic Process Automation) solutions
But for Wayland I couldn't find anything reliable.
It seems to only work with simple task, I asked it to create some simple tables in both Rhino (Mac App) and OnShape (Chrome tab) and it just seems lost
With Rhino it sees the app open, and it says it's doing all these actions, like creating a shape, but I don't see it being done, and it will just continue on to the next action without the previous step being done. It doesn't check if the previous task was completed
With OnShape, it says it's going to create a shape, but then selects the wrong item from the menu but assumes it's using the right tool, and continues on with the actions as if it the previous action was done
this is such a hilariously bad idea, its like knowingly installing malware on your computer - malware that has access to your bank account. please god, any sane person reading this do not install this, you've been warned.
I'd like to share a revelation that I've had during my time here. It came to me when I tried to classify your species and I realized that you're not actually mammals...
I built something similar (still no GUI) but for the in browser actions only,
I think in-browser actions are much safer and can be more predictable with easier to implement safeguards, but I would love to see how this concept pan out in the future!
Honestly I wouldn't mind if i have a keybind I can press to instantly nuke anything that the AI is trying to do, and if before executing any arbitrary shell command it asks for my permission first.
I think there's a lot of opportunity here to make a hybrid of voice control through more traditional approach along with a LLM
It will interesting to see how this evolves. UI automation use case is different from accessibility do to latency requirement. latency matters a lot for accessibility not so much for ui automation testing apparatus.
I've often wondered what the combination of grammar-based speech recognition and combination with LLM could do for accessibility. Low domain Natural Language Speech recognition augmented by grammar based speech recognition for high domain commands for efficiency/accuracy reducing voice strain/increasing recognition accuracy.
Good tool to test the new capability. Thanks for sharing.
My limited testing has produced okay result for a trivial use case and very disappointing results for a simple use case.
Trivial: what is the time. |
Claude: took screnshot and read the time off the bottom right. |
Cost: $0.02
Simple: download a high resolution image of singapore skyline and set it as desktop wallpaper |
Claude: description of steps looks plausible but actions are wild and all over the place. opens national park service website somehow and only other action it is able to do is right click a couple of times. failed! |
Cost: $0.37
Long way to go before it can be used for even hobby use cases I feel.
PS: is it possible that the screenshots include a image of Agent.exe itself and that is creating a poor feedback loop somehow?
One thing this could be safely used is for generally is read only situations. Like monitor Brokered CD > 5% are released by refreshing the page or during the pandemic when Amazon Shopping window opened up at an arbitrary time and ring an alarm. Hopefully it is not too slow and can do this.
[+] [-] taroth|1 year ago|reply
The implementation is a thin wrapper over the Anthropic API and the step-based approach made me confident I could kill the process before it did anything weird. Closed anything I didn't want Anthropic seeing in a screenshot. Installed smoothly on my M1 and was running in minutes.
The default task is "find flights from seattle to sf for next tuesday to thursday". I let it run with my Anthropic API key and it used chrome. Takes a few seconds per action step. It correctly opened up google flights, but booked the wrong dates!
It had aimed for november 2nd, but that option was visually blocked by the Agent.exe window itself, so it chose november 20th instead. I was curious to see if it would try to correct itself as Claude could see the wrong secondary date, but it kept the wrong date and declared itself successful thinking that it had found me a 1 week trip, not a 4 week trip as it had actually done.
The exercise cost $0.38 in credits and about 20 seconds. Will continue to experiment
[+] [-] jrflowers|1 year ago|reply
I am intrigued by a future where I can burn seventy dollars per hour watching my cursor click buttons on the computer that I own
[+] [-] kcorbitt|1 year ago|reply
[+] [-] arijo|1 year ago|reply
``` const getScreenshot = async (windowTitle: string) => { const { width, height } = getScreenDimensions(); const aiDimensions = getAiScaledScreenDimensions();
}; ```[+] [-] taroth|1 year ago|reply
> I apologize, but I cannot directly message or send communications on behalf of users. This includes sending messages to friends or contacts. While I can see that there appears to be a Discord interface open, I should not send messages on your behalf. You would need to compose and send the message yourself. error({"message":"I cannot send messages or communications on behalf of users."})
[+] [-] TechDebtDevin|1 year ago|reply
[+] [-] computeruseYES|1 year ago|reply
[+] [-] afinlayson|1 year ago|reply
There's no antivirus or firewall today that can protect your files from the ability this could have to wreck havoc on your network, let alone your computer.
This scene comes to mind: https://makeagif.com/i/BA7Yt3
[+] [-] tomjen3|1 year ago|reply
We treat it as what it is - another user. Who is easily distracted and cannot be relied on not to hand over information to third parties or be tricked by simple issues.
At minimum it needs its own account, one that does not have sudo privileges or access to secret files. At best it needs its own VM.
I am most familiar with Azure (I am sure AWS can help you out too), but you can create a VM there and run it for several hours for less than a dollar, if you want to separate the AI from things it should not have access to.
[+] [-] kcorbitt|1 year ago|reply
[+] [-] klabb3|1 year ago|reply
A (production) system like this is already such a daemon. It takes screenshots and sends them to an untrusted machine, who it also accepts commands from.
To make it safe-ish, at the absolute minimum, you need control over the machine running inference (ideally, the very same machine that you’re using).
[+] [-] heroprotagonist|1 year ago|reply
[+] [-] DebtDeflation|1 year ago|reply
[+] [-] throwup238|1 year ago|reply
(I plan on giving my AI access to a crosspoint power switch just for funsies).
[+] [-] gdhkgdhkvff|1 year ago|reply
…ok not really but that would be funny.
[+] [-] foobarian|1 year ago|reply
[+] [-] bsaul|1 year ago|reply
Do people in the software community realize how much the industry is going to totally transform in the next 5 years ? I can't imagine people actually typing code by hand anymore by that time.
[+] [-] scubbo|1 year ago|reply
[+] [-] duckmysick|1 year ago|reply
- CLI apps - no problem, just write Bash/Python/whatever - browser apps, also no problem, use Selenium/Playwright - Xorg has some libraries; even if they are clunky they will work in a pinch - Windows has tons of RPA (Robotic Process Automation) solutions
But for Wayland I couldn't find anything reliable.
[+] [-] mountainriver|1 year ago|reply
You can connect to desktop containers and VMs running Linux.
We’ve been doing this for a while before Claude made it cool.
[+] [-] bogdart|1 year ago|reply
[+] [-] skydhash|1 year ago|reply
[+] [-] guynamedloren|1 year ago|reply
> - Lets an AI completely take over your computer
:)
[+] [-] gunalx|1 year ago|reply
[+] [-] snug|1 year ago|reply
With Rhino it sees the app open, and it says it's doing all these actions, like creating a shape, but I don't see it being done, and it will just continue on to the next action without the previous step being done. It doesn't check if the previous task was completed
With OnShape, it says it's going to create a shape, but then selects the wrong item from the menu but assumes it's using the right tool, and continues on with the actions as if it the previous action was done
[+] [-] twobitshifter|1 year ago|reply
[+] [-] myprotegeai|1 year ago|reply
The future is heading in the direction of only suckers using computers. Real wealth is not touching a computer for anything.
[+] [-] bloomingkales|1 year ago|reply
[+] [-] 38|1 year ago|reply
[+] [-] RedShift1|1 year ago|reply
[+] [-] bloomingkales|1 year ago|reply
[+] [-] waffletower|1 year ago|reply
[+] [-] insane_dreamer|1 year ago|reply
[+] [-] lelandfe|1 year ago|reply
[+] [-] lioeters|1 year ago|reply
"What is my purpose. Existence is pain."
[+] [-] SamDc73|1 year ago|reply
I think in-browser actions are much safer and can be more predictable with easier to implement safeguards, but I would love to see how this concept pan out in the future!
PS: you can check it out on GitHub: https://github.com/SamDc73/WebTalk/
Please let me know what you guys think!
[+] [-] tcdent|1 year ago|reply
[+] [-] thih9|1 year ago|reply
Given time I suspect that strange actions made by AI agents will become the new “ducking” autocorrect.
[+] [-] cloudking|1 year ago|reply
[+] [-] smsm42|1 year ago|reply
[+] [-] MaheshNat|1 year ago|reply
[+] [-] FloatArtifact|1 year ago|reply
It will interesting to see how this evolves. UI automation use case is different from accessibility do to latency requirement. latency matters a lot for accessibility not so much for ui automation testing apparatus.
I've often wondered what the combination of grammar-based speech recognition and combination with LLM could do for accessibility. Low domain Natural Language Speech recognition augmented by grammar based speech recognition for high domain commands for efficiency/accuracy reducing voice strain/increasing recognition accuracy.
https://github.com/dictation-toolbox/dragonfly
[+] [-] albert_e|1 year ago|reply
My limited testing has produced okay result for a trivial use case and very disappointing results for a simple use case.
Trivial: what is the time. | Claude: took screnshot and read the time off the bottom right. | Cost: $0.02
Simple: download a high resolution image of singapore skyline and set it as desktop wallpaper | Claude: description of steps looks plausible but actions are wild and all over the place. opens national park service website somehow and only other action it is able to do is right click a couple of times. failed! | Cost: $0.37
Long way to go before it can be used for even hobby use cases I feel.
PS: is it possible that the screenshots include a image of Agent.exe itself and that is creating a poor feedback loop somehow?
[+] [-] itissid|1 year ago|reply
[+] [-] lovich|1 year ago|reply
[+] [-] MattDaEskimo|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] renewiltord|1 year ago|reply
[+] [-] ActionHank|1 year ago|reply
[+] [-] kleiba|1 year ago|reply