top | item 40371469

(no title)

KhoomeiK | 1 year ago

We run OCR on the screenshot & convert it to whitespace-structured text, that is passed to the LLM. The images below might make it clearer for you:

[1] https://github.com/reworkd/tarsier/blob/main/.github/assets/...

[2] https://github.com/reworkd/tarsier/blob/main/.github/assets/...

discuss

shodai80|1 year ago

Provided screenshots below do not show textboxes, selects, or other input nodes with labels. Show me text output with associated labels for inputs being correct and I will be shocked.

KhoomeiK|1 year ago

They do show textboxes with labels. From our readme:

"Keep in mind that Tarsier tags different types of elements differently to help your LLM identify what actions are performable on each element. Specifically:

[#ID]: text-insertable fields (e.g. textarea, input with textual type)

[@ID]: hyperlinks (<a> tags)

[$ID]: other interactable elements (e.g. button, select)

[ID]: plain text (if you pass tag_text_elements=True)"

Do you see the search boxes labeled [#4] and [#5] at the top? And before you say that the tag is on a different line from the placeholder text—yes, and our agent is smart enough to handle that minor idiosyncrasy. Are you shocked? :)

miki123211|1 year ago

This problem isn't that hard, screen readers had to handle this exact issues for years. Inaccessible websites where the labels aren't properly associated with their respective form fields do exist, but aren't that common.