top | item 41365287

Using Multimodal LLMs to Understand UI Elements on Websites

11 points| daniel_mp | 1 year ago |qa.tech | reply

5 comments

[+] patricklef|1 year ago|reply

MLLMs are surprisingly bad at this out of the box and to some extent even with fine tuning. https://jina.ai/news/the-what-and-why-of-text-image-modality...

[+] antonoo|1 year ago|reply

Love that you made an interactive app to visualize how the model performs similar to how Meta AI usually releases their models

(direct link: https://qa-tech-minicpm-demo.gptengineer.run)

[+] antonoo|1 year ago|reply

Did you use AI to generate this app?

[+] while1|1 year ago|reply

Loving this! Very surprising that the LLMs of today are so bad at understanding interfaces but it also makes it a very interesting case for finetuning!

[+] albinekb|1 year ago|reply

[deleted]