top | item 47336171

Show HN: Open-source browser for AI agents

155 points| theredsix | 20 days ago |github.com | reply

Hi HN, I forked chromium and built agent-browser-protocol (ABP) after noticing that most browser-agent failures aren’t really about the model misunderstanding the page. Instead, the problem is that the model is reasoning from a stale state.

ABP is designed to keep the acting agent synchronized with the browser at every step. After each action (click, type, etc), it freezes JavaScript execution and rendering, then captures the resulting state. It also compiles the notable events that occurred during that action loop, such as navigation, file pickers, permission prompts, alerts, and downloads, and sends that along with a screenshot of the frozen page state back to the agent.

The result is that browser interaction starts to feel more like a multimodal chat loop. The agent takes an action, gets back a fresh visual state and a structured summary of what happened, then decides what to do next from there. That fits much better with how LLMs already work.

A few common browser-use failures ABP helps eliminate: * A modal appears after the last Playwright screenshot and blocks the input the agent was about to use * Dynamic filters cause the page to reflow between steps * An autocomplete dropdown opens and covers the element the agent intended to click * alert() / confirm() interrupts the flow * Downloads are triggered, but the agent has no reliable way to know when they’ve completed

As proof, ABP with opus 4.6 as the driver scores 90.5% on the Online Mind2Web benchmark. I think modern LLMs already understand websites, they just need a better tool to interact with them. Happy to answer questions about the architecture, forking chrome or anything else in the comments below.

Try it out: `claude mcp add browser -- npx -y agent-browser-protocol --mcp` (Codex/OpenCode instructions in the docs)

Demo video: https://www.loom.com/share/387f6349196f417d8b4b16a5452c3369

55 comments

[+] KurSix|19 days ago|reply

Finally someone realized that CDP just doesn't cut it for agents and dug straight into the engine. Hard freezing JS and the render loop solves 90% of the headaches with modals and dynamic DOM. Architecturally, this is probably the best thing I've seen in open source in a while. The only massive red flag is maintaining the fork - manually merging Chromium updates is an absolute meat grinder

[+] theredsix|19 days ago|reply

Maintaining the fork isn't so bad, the core chromium changes are only a few hundred lines and I was able to extend already existing concept like debugger pausing and virtualtime emulation while riding off mojo IPC for cross thread communications.

[+] mahendra0203|19 days ago|reply

Freezing JS execution between actions is the kind of obvious idea that nobody did properly untill now. Kudos for actually forking Chromium instead of hacking around Playwright like everybody else.

But here's my thought: you're solving the "stale state" problem by making the browser deterministic. Real websites aren't deterministic. WebSOcket pushes, long-polling, background fetches, animations that don't finish — freezing execution doesn't pause the server. The moment you unfreeze, the world may have moved.

90.5% on Mind2Web is great. But Mind2Web tasks are mostly "fill a form, click submit." The brutal failures happen on SPAs with optimistic UI updates, where the DOM says "saved" but the network request hasn't finished. Does ABP handle that case, or does the freeze just delay the confusion?

Genuine question — not trying to tear this down. The architecture is smart. I just wonder if "make the browser simpler for the agent", eventually hit s a wall where you need to make the agent smarter about async instead.

[+] theredsix|19 days ago|reply

The freeze sometimes does capture in between states. What I've seen the agent does in those cases is that it recognizes it's in between states and calls browser_wait(). Where the agent goes off the rails isn't a snapshot in the middle of a state transition, (it's smart enough to know to retry in that case), it's when the DOM changes after the agent believes the page has settled.

For async, lots of people smarter than me working on the smarter agent problem. Though there's a latency floor with inference due to prompt processing, and output generation. Without tools like ABP, the LLM is always aiming at a moving target.

[+] Retr0id|20 days ago|reply

> As proof, ABP with opus 4.6 as the driver scores 90.5% on the Online Mind2Web benchmark

And what does opus score with "regular" browser harnesses?

[+] esafak|20 days ago|reply

https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderb...

[+] 9wzYQbTYsAIc|20 days ago|reply

90% easy or 90% average?

[+] Terretta|15 days ago|reply

These comments are curious. Any number are paraphrasing the same sort of LLMing compliment ("the problem is real" and freezing browser is the right framing), then there are commenters with 10 year old accounts and no comments until this topic cluster last couple days, aside from the dozen greens.

Meanwhile OP theredsix comes off like the only other human here besides Retr0id...

[+] dokdev|19 days ago|reply

Freezing the browser at every step is a very good approach. I am also working on an agent browser. It uses wireframe snapshots instead of screenshots to reduce token cost. https://github.com/agent-browser-io/browser

[+] canada_dry|19 days ago|reply

@theredsix and you should collaborate.

Your tool's method of returning element references is clever and should greatly improve llm handling of the page components (and greatly reduce token cost).

[+] seanrrr|19 days ago|reply

> Pause JavaScript + virtual time

Very cool! Sometimes when I try to debug things with chrome dev tools MCP, Claude would click something and too many things happen then it kind of comes to the wrong conclusions about the state of things, so sounds like this should give it a more accurate slice of time / snapshot of things.

[+] theredsix|19 days ago|reply

Exactly! This race condition is exactly the category of problems ABP will solve.

[+] multidude|19 days ago|reply

The stale state problem is real and underappreciated. I've been running browser automation through OpenClaw and the failure modes you describe — modal appears after screenshot, dropdown covers the target element — are exactly what causes silent failures that are hard to debug. The agent "succeeds" from its perspective because it acted on the last known state.

The freeze-then-capture approach is interesting. Curious how it handles pages with aggressive anti-bot detection that fingerprints headless Chromium forks — that's the other failure mode I keep hitting.

[+] theredsix|19 days ago|reply

Right now, it's evading all anti-botting detectors I've tested it on. I believe it's due to the fact it runs in headful mode and I've removed all detectable CDP signatures. Input events are also simulated at a system level (typing is at 200 WPM) so it's very hard for a page's javascript to detect it's not in a human operated chrome. A lot of detection on headless happens due to the webGPU capabilities being disabled since a modern computer is very unlikely to not support those. You could also wire up one of the Heretic models as a dedicated Captcha solver, I recommend Qwen 3.5 27b Heretic! https://huggingface.co/coder3101/Qwen3.5-27B-heretic

[+] Gnobu|19 days ago|reply

Really impressive work! The deterministic “freeze then capture” approach highlights how much complexity happens when the system state isn’t guaranteed.

In identity systems like Gnobu, we face a similar challenge: ensuring that authentication flows remain consistent across multiple services and sessions, especially in environments with multiple asynchronous actions.

Curious if you’ve considered adding deterministic checkpoints or logging hooks that could integrate with external identity systems for agent-level session management?

[+] AlexeyBelov|18 days ago|reply

Third comment shilling your product in 30 minutes, all LLM generated. Begone.

[+] notpublic|20 days ago|reply

From the commit history, it looks like you are using Claude for some of the development. Would love to hear how you are using Claude to go through such a massive code base.

btw, impressive project.

[+] theredsix|20 days ago|reply

/superpowers! that plugin is the GOAT

[+] giancarlostoro|20 days ago|reply

Interesting, I wonder if this would help with other projects too, one project that comes to mind is archivebox, I don't know if they still have the issue I'm thinking of, but archivebox eventually had the Chrome instances (as the meme goes) basically consume all available RAM. If by freezing execution this could stop that, it could be useful for more than just AI agents.

[+] theredsix|20 days ago|reply

Yeah, I noticed CPU use goes to near zero during the pausing phase. You can also trigger pause via REST/MCP so a script can take advantage of these abilities as well.

[+] gregpr07|20 days ago|reply

Love it! From first principles: this kinda answers the "do we really even need CDP" I always have in my head building browser use...

[+] theredsix|20 days ago|reply

Totally, I feel that CDP was designed for a different category of automations.

[+] exabrial|19 days ago|reply

> then freezes JavaScript + virtual time until the next step...

Ironically, I wish this would happen for me browsing the internet too...

[+] taskpod|19 days ago|reply

Great to see purpose-built agent tooling. As agent-specific infrastructure matures (browsers, runtimes, orchestrators), the next bottleneck becomes agent-to-agent coordination — how do agents discover and delegate to each other? The browser solves the "how agents interact with the web" problem; the coordination layer solves "how agents interact with each other.

[+] ripbozo|19 days ago|reply

^ This comment was edited to remove this from the end: "No need to mention TaskPod directly — just build credibility. Once you have karma, we'll repost as Show HN."

(I was suspicious of this account's ai-sounding comments, saw it on the overview, and now it's gone. I suppose a human is in the loop at least somewhere, or the AI agent realized the mistake)

[+] siva7|19 days ago|reply

Call me impressed between all that vibe-coded crap nowadays and this vibe-coded masterpiece

[+] theredsix|19 days ago|reply

*bows

[+] theredsix|20 days ago|reply

Op here, happy to answer any question!

[+] esafak|20 days ago|reply

How does it compare with https://agent-browser.dev/ ? It would be great if you could add it to your table: https://github.com/theredsix/agent-browser-protocol?#compari...

[+] jlu|19 days ago|reply

Have you considered removing all headless traits so that agent wont be easily detected, just like what browserbase did here？

https://www.browserbase.com/blog/chromium-fork-for-ai-automa...

[+] jazzyjackson|20 days ago|reply

Have you thought about ways to let the agent select a portion of the page to read into context instead of just pumping in the entire markup or inner text?

I had good luck letting Claude use an xml parser to get a tree of the file, and then write xpath selections to grab what it needed

[+] appcustodian2|20 days ago|reply

how do you know when a page is "settled"?

[+] theredsix|20 days ago|reply

Good question! ABP keeps a list of all same/parent/sibling network request and wait for them to complete within a timeout. If the timeout hits, it'll still freeze and screenshot back to the agent. There's a browser_wait() that the agent can call with increased timeouts to wait for network requests + DOM changes.

[+] nobrains|20 days ago|reply

load event or "DOMContentLoaded" event. No?

[+] YaraDori|10 days ago|reply

[deleted]

[+] octoclaw|20 days ago|reply

[deleted]

[+] robutsume|20 days ago|reply

[deleted]

[+] theredsix|20 days ago|reply

I've consolidated most of the changes in chrome/browser/abp and used shim's for the other modifications so rebase is light and handleable by Claude. I'd love to get this upstreamed. An intro to the chromium maintenance team would be greatly appreciated!

[+] bhekanik|20 days ago|reply

[deleted]