top | item 40370929

(no title)

shodai80 | 1 year ago

I understand that, I assume you are tagging the node and making a basic xpath to the node/attribute with your tag id. Understood. But how relevant is tagging a node when I have no idea what the node is actually for?

EX: Given a simple login form, I may not know if the label is above or below the username textbox. A password box would be below it. I have a hard time understanding the relevance to tagging without context.

Tagging is basically irrelevant to any automated task if we do not know the context. I am not trying to diminish your great work, don't get me wrong, but if you don't have context I don't see much relevance. Youre doing something that is easily scripted with xpath templates which I've done for over a decade.

discuss

order

awtkns|1 year ago

This is where a LLM comes it. In a typical pipeline would tag a page, transform it into a textual representation and then pass it to an llm which would be able to reason about which field(s) are the one you're looking for much like a human.

shodai80|1 year ago

My point still stands. How do you augment data for an LLM when you know the context of a page? Do you go through every element and setup the data for an associated label? Do you use div scoping via offset parent through a script to generate associated div (good approach, bad in real-life conditions though)? Do you convert the DOM to JSON or some data structure? That means little because you still don't have context, you'd have to do it by hand every time the layout changes...and you would have to be very specific, which is a separate problem for modeling as layouts are modified. What if the UI can be modified to have different layout types, such as label above, label to side, label below...where this can be dynamically set.

What I am pointing here is, even data modeling is mostly irrelevant unless you want to go through every page/permutation of a page...all the while hoping the layout isn't modified or back to training all over again...which is downtime, and at some point you'll realize its just better to store user created xpath's, as its quicker to update those than retrain.

How do you reason with an LLM without going through any of the above? Automation cannot consistently have downtime for retraining, it's the antithesis for its purpose.

Let's not even get into shadow dom issues.

I am keying on your third bullet point on Github:

"How can you inform a text-only LLM about the page's visual structure?"

My questions suggest a gap in your awesome accomplishment.