top | item 46616481

Show HN: Webctl – Browser automation for agents based on CLI instead of MCP

134 points| cosinusalpha | 1 month ago |github.com

Hi HN, I built webctl because I was frustrated by the gap between curl and full browser automation frameworks like Playwright.

I initially built this to solve a personal headache: I wanted an AI agent to handle project management tasks on my company’s intranet. I needed it to persist cookies across sessions (to handle SSO) and then scrape a Kanban board.

Existing AI browser tools (like current MCP implementations) often force unsolicited data into the context window—dumping the full accessibility tree, console logs, and network errors whether you asked for them or not.

webctl is an attempt to solve this with a Unix-style CLI:

- Filter before context: You pipe the output to standard tools. webctl snapshot --interactive-only | head -n 20 means the LLM only sees exactly what I want it to see.

- Daemon Architecture: It runs a persistent background process. The goal is to keep the browser state (cookies/session) alive while you run discrete, stateless CLI commands.

- Semantic targeting: It uses ARIA roles (e.g., role=button name~="Submit") rather than fragile CSS selectors.

Disclaimer: The daemon logic for state persistence is still a bit experimental, but the architecture feels like the right direction for building local, token-efficient agents.

It’s basically "Playwright for the terminal."

38 comments

binalpatel|1 month ago

Cool to see lots of people independently come to "CLIs are all you need". I'm still not sure if it's a short-term bandaid because agents are so good at terminal use or if it's part of a longer term trend but it's definitely felt much more seamless to me then MCPs.

(my one of many contribution https://github.com/caesarnine/binsmith)

cosinusalpha|1 month ago

I am also not sure if MCP will eventually be fixed to allow more control over context, or if the CLI approach really is the future for Agentic AI.

Nevertheless, I prefer the CLI for other reasons: it is built for humans and is much easier to debug.

fudged71|1 month ago

Thank you for posting binsmith, I've built something similar over the past few days and you've made some great decisions in here

0x696C6961|1 month ago

MCP let's you hide secrets from the LLM

desireco42|1 month ago

Hey this looks cool. So each agent or session is one thread. Nice. I like it.

the_mitsuhiko|1 month ago

At this point I'm fully down the path of the agent just maintaining his own tools. I have a browser skill that continues to evolve as I use it. Beats every alternative I have tried so far.

dtkav|1 month ago

Same. Claude Opus 4.5 one-shots the basics of chrome debug protocol, and then you can go from there.

Plus, now it is personal software... just keep asking it to improve the skill based on you usage. Bake in domain knowledge or business logic or whatever you want.

I'm using this for e2e testing and debugging Obsidian plugins and it is starting to understand Obsidian inside and out.

cosinusalpha|1 month ago

Do you experience any context pollution with that approach?

kinduff|1 month ago

whats the name of the skill?

gregpr07|1 month ago

Creator of Browser Use here, this is cool, really innovative approach with ARIA roles. One idea we have been playing around with a lot is just giving the LLM raw html and a really good way to traverse it - no heuristics, just BS4. Seems to work well, but much more expensive than the current prod ready [index]<div ... notation

cosinusalpha|1 month ago

Thanks!

I actually tried a raw HTML when I was exploring solutions. It worked for "one-off" tasks, but I ran into major issues with replayability on modern SPAs.

In React apps, the raw DOM structure and auto-generated IDs shift so frequently that a script generated from "Raw HTML" often breaks 10 minutes later. I found ARIA/semantics to be the only stable contract that persists across re-renders.

You mentioned the raw HTML approach is "expensive". Did you feed the full HTML into the context, or did you create a BS4 "tool" for the LLM to query the raw HTML dynamically?

TheTaytay|1 month ago

I really like this idea!

I’d like to see this other browser plugin’s API be exposed via your same CLI, so I don’t have to only control a separate browser instance. https://github.com/remorses/playwriter (I haven’t investigated enough to know how feasible it is, but as I was reading about your tool, I immediately wanted to control existing tabs from my main browser, rather than “just” a debug-driven separate browser instance.)

cosinusalpha|1 month ago

Thanks! To clarify: webctl allows you to manually interact with the browser window at any time. It even returns "manual interaction" breakpoints to stdout if it detects an SSO/login wall.

But I agree, attaching to the OS "daily driver" instance specifically would be a nice addition.

randito|1 month ago

If you look at Elixir keynote for Phoenix.new -- a cool agentic coding tool -- you'll see some hints about a browser control using a API tool call. It's called "web" in the video.

Video: https://youtu.be/ojL_VHc4gLk?t=2132

More discussion: https://simonwillison.net/2025/Jun/23/phoenix-new/

unknown|1 month ago

[deleted]

renegat0x0|1 month ago

A little bit different, but also allows to scrape efficiently. Json http communication rather than cli.

https://github.com/rumca-js/crawler-buddy

More like a framework for other mechanisms

philipbjorge|1 month ago

This looks remarkably similar to https://github.com/vercel-labs/agent-browser

How is it different?

cosinusalpha|1 month ago

To be honest, I hadn't seen that one yet!

The main difference is likely the targeting philosophy. webctl relies heavily on ARIA roles/semantics (e.g. role=button name="Save") rather than injected IDs or CSS selectors. I find this makes the automation much more robust to UI changes.

Also, I went with Python for V1 simply for iteration speed and ecosystem integration. I'd love to rewrite in Rust eventually, but Python was the most efficient way to get a stable tool working for my specific use case.

hugs|1 month ago

vibium clicker, too. https://github.com/VibiumDev/vibium/blob/main/CONTRIBUTING.m...

"browser automation for ai agents" is a popular idea these days.

desireco42|1 month ago

How are you holding session if every command is issues through cli? I assume this is essential for automation.

cosinusalpha|1 month ago

A background daemon holds the session state between different CLI calls. This daemon is started automatically on the first webctl call and auto-closes after a timeout period of inactivity to save resources.

grigio|1 month ago

is there a benchmark? there are a lot of scraping agents nowdays..

cosinusalpha|1 month ago

I don't have an objective benchmark yet. I tried several existing solutions, especially the MCP servers for browser automation, and none of them were able to reproducibly solve my specific task.

An objective benchmark is a great idea, especially to compare webctl against other similar CLI-based tools. I'll definitely look into how to set that up.

unknown|1 month ago

[deleted]

Agent_Builder|1 month ago

[deleted]

cosinusalpha|1 month ago

I actually think the CLI approach helps with those boundaries. Because webctl commands are discrete and pipeable (e.g. webctl snapshot | llm | webctl click), the "authority" is reset at every step of the pipeline. It feels easier to audit a text stream of commands than a socket connection that might be accumulating invisible context.

AI-love|1 month ago

[deleted]