A stateful browser agent using self-healing DOM maps

tnolet|4 months ago

This is, as far as I understand, self healing ONLY if the name of a CSS class changes. Not for anything else. That seems like a very very very very narrow definition of "self healing": there are 9999 other subtle or not so subtle things that can change per session or per update version of a page.

If you run this against let's say a typical e-commerce page where the navigation and all screen elements are super dynamic — user specific data, language etc. — this problems becomes even harder.

ljm|4 months ago

My running hypothesis on this is that AI is a sentient screenreader and the last thing you should be worrying about is CSS class names, IDs, data-testid attributes, DOM traversal, and all of these things that are essentially querying the 'internal state' of a page. Classes, IDs, data attributes, etc. aren't a public API and semantic elements, ARIA attributes, etc. are.

So, focus on WCAG compliance, following the spec as faithfully as you can. The style or presentation of something may change as part of a simple A/B test but the underlying purpose or behaviour would remain the same.

pverheggen|4 months ago

I feel like this could work if the selectors are chosen carefully to capture semantic meaning, rather than basing off of something arbitrary like a class name. The agent must have some understanding of the document to be able to perform those actions in the first place.

If it can find an ellipse tool, it's likely based off some combination of accessible role, accessible name, and inner text (perhaps the icon if it's multi-modal.) So in theory, couldn't it capture that criteria in a JS snippet and replay it?

artpar|4 months ago

Everyone thinks of typical e-commerce pages when its comes "browser agent doing something", but our real use cases are far from shopping for the user. But your point still stands valid. The idea is that maybe there are websites where generating stable selectors/hierarchy maps wouldn't solve, but 80% (from 80-20) of websites are not like that (including a lot of internal dashboard/interfaces) (there will also be issues for websites with proper i18n implementations if the selectors are aria label based)

Self healing css selectors is also only 1 part of the story. The other part is the cohesive interface for the agent itself to use these selectors.

philo23|4 months ago

Maybe this is a lack of understanding on my part, but this bit of the explanation sets off alarm bells for me:

> Under the hood, we're building a client-sourced RAG for the DOM. An agent's first move on a page is to check a vector DB for a known "map." ... This creates a wild side-effect: the system is self-healing for everyone. One person's failed automation accidentally fixes it for the next hundred users.

I think I'd like to know exactly what kind of data is extracted from the DOM to build that shared map.

artpar|4 months ago

Agent4 is going to store "stable selectors" that worked (when it performs a task first time most of the time is spent in identifying these css/xpath selectors). Memories are pretty straighforward at this point, they are stored locally in your browser's IndexedDB (you can inspect from chrome inspector).

simpaticoder|4 months ago

Couldn't you solve this by having the agent do a first pass through a page and generate a (java)script that interacts with the interesting parts of the page, and then prepend the script (if it's short enough) or a list of entry points (if it's not) to the prompt such that subsequent interactions invoke the script rather than interact directly with the page?

artpar|4 months ago

If I am reading you correctly, you captured the whole essence of agent4.

So it does the first pass (based on your goals) makes memories (and these are local)

Now you tell the agent you want to do this repeatedly, so it will make a workflow (these workflows are saved on server, currently all public for now but we are working out permissions/group based access) for you based on these memories and interactions.

The problem is many times that the agent thinks is stable isn't really, so there a feedback loop for the agent to test out the workflow and improve them. (its basically claude code/codex sitting in the browser)

Workflow details are appended to prompt based on user query match/opened tabs match.

bogdanoff_2|4 months ago

Asking here because it seems related: I'm trying to use cursor to work on a webapp. It gets frustrating because vanilla Cursor is "coding blind" and can't actually see the result of what it is doing, and whether or not it works.

I ask it to fix something. It claims to know what the problem is, changes the code, and then claims it's fixed. I open app, and it's still broken. I have to continuously and way to often repeatedly tell it what it broken.

Now, supposing I'm "vibe coding" and don't really care about the obvious fact that the AI doesn't actually know what it is doing, it's still frustrating that I have to be in the loop just to provide very basic information like that.

Are there any agentic coding setups that allow the agent to interact with the app it's working on to check if it actually works?

JimDabell|4 months ago

You can use things like Browser Use and Playwright to hook things like that up, but you’re right, this is a very underdeveloped area. Armin Ronacher has a talk that covers some of this, such as unifying console.log, server logs, SQL, etc. to feed back to the LLM.

https://www.youtube.com/watch?v=nfOVgz_omlU

xnx|4 months ago

Gemini CLI Chrome devtools MCP addresses this: https://developer.chrome.com/blog/chrome-devtools-mcp

tomashubelbauer|4 months ago

Look into the Playwright MCP server, it allows coding agents to scrutinize the results of their work in the web browser. There is also an MCP server for the Chrome DevTools protocol AFAIK but I haven't tried it.

shardullavekar|4 months ago

a built-in mcp server that takes a look at what's broken and communicates with cursor is on our roadmap. Join discord and we will keep you posted there.

artpar|4 months ago

So actually I have this setup (of a bridge server) which I use for agent4 itself (so claude code can talk to agent4), It makes a lot of sense to publish that bridge as well in the MCP form.

klntsky|4 months ago

I vibed something like this for markdown extraction just a week ago: https://github.com/promptware/readweb

Opensourced it just now.

More specifically, it works like this:

  suggestPreset: HTML -> Preset (via LLM)
  applyPreset: HTML + Preset -> Markdown (programmatically)

Where preset is:

  type Preset = {
    // anchors to make this preset more fragile on purpose.
    // Elements that identify website engine layout go here.
    preset_match_detectors: CSSSelector[];
    // main content extractors
    main_content_selectors: CSSSelector[];
    // filter selectors to trim the main content.
    // banners, subscription forms, sponsor content, etc.
    main_content_filters: CSSSelector[];
  };

suggestPreset uses a feedback loop that enhances + applies preset until the markdown is really clean

rco8786|4 months ago

This tool seems relevant to my interests, but I gotta say I cannot figure out how to use the extension.

It seems like I'm only able to use the pre-existing/canned workflows that are provided under different "Persona"s? And there's no way for me to just create a new workflow from scratch for my specific use case.

Am I missing something obvious?

shardullavekar|4 months ago

We launched Agent4 recently. You can install it from here: https://chromewebstore.google.com/detail/agent4/kipkglfnhnpb...

The one you refer will be taken down soon. Ping me on discord if you need help in trying it.

unknown|4 months ago

[deleted]

jadbox|4 months ago

Is Agent4 open-source? I'm only installing OSS browser extensions for some level of verification.

artpar|4 months ago

No it is not open-source. But it is not obfuscated either, so you can always look into the code by downloading the plugin from chrome webstore if (and these days llms can help with that a lot) if you are into that kind of verification.

ripped_britches|4 months ago

“One persons map fixes everyone else’s”

Hm somehow I feel like this is a giant step in the wrong direction.

artpar|4 months ago

Worst case scenario we can just shut down sharing/public workflows altogether, or do you have something else in mind ?

arjunchint|4 months ago

So to make sure I am understanding this, even though a site update its selectors weekly for example like LinkedIn, your automation agent would still continue to work.

But if a website changes its UX and your recording no longer works then it will fail?

Working in the browser agent space myself, although you save on cost with these repeatable recordings the true disruption of browser agents is using one prompt on thousands of websites without having to worry about maintenance at all

virajk_31|4 months ago

This may not be relevant but it also can't bypass many CF protected websites. Even completing the checkbox challenge doesn’t resolve the issue, it results in infinite redirects to the same Turnstile page. I believe this isn’t an LLM interpretation issue but rather problem with how the browser requests are being treated by security endpoint mechanisms.

arkmm|4 months ago

Neat approach, but seems like the eventual goal of caching DOM maps for all users would be a privacy nightmare?

artpar|4 months ago

Yes I can imagine PI somehow being stored in the workflow. I frequently see llms hardcoding tests just to make user happy and this can also happen in the browser version where if something is too hard to scrape but agent is able to infer from screenshot so it might end up making a workflow that seems correct but is just hardcoded with data. We are thinking of multiple guards/blocks to not let user create such a workflow, but the risks that come with an open ended agent are still going to be present.

nateb2022|4 months ago

Why not consider the cosine similarity between elements and focus on developing better feature vectors? Possibly models trained with a focus on DOM semantics and graph structure?

jjangkke|4 months ago

is there an open source version of this in github? i think i've seen something similar.

one off putting thing about installing the extension is all the reviewers seem to be Indian and I've seen similar patterns across Google Reviews where there is a flood of reviews from Indian users and they are almost always fraud or some weird scam

not saying this is the case here but whenever I see a bunch of reviews from Indian names, it automatically makes me trust whatever service or product less just fyi.

dgfitz|4 months ago

At the morning standup:”OK team I need all of you to post positive reviews and use your network to accomplish that as you see fit.”

brianjking|4 months ago

Is this able to load for anyone?

shardullavekar|4 months ago

It's a chrome extension. Works if you use chrome.

phgn|4 months ago

Nope. Their entire website shows up with a white screen for me in the latest Chrome.

There's this error in the console: Failed to load module script: Expected a JavaScript-or-Wasm module script but the server responded with a MIME type of "text/html". Strict MIME type checking is enforced for module scripts per HTML spec.

neuroelectron|4 months ago

DOM has clearly gotten to the point where it's no longer maintainable or a net benefit to the web.

artpar|4 months ago

Yeah like so many legacy things, unfortunately they are not going away that fast. People are still clicking on these tedious interfaces day in day out to get all the "smaller" stuff running. Even if every one agreed on the "One Best UI", it would take decades to convert all the existing ones before breaking a lot of flows.

unknown|4 months ago

[deleted]

58 comments