top | item 41365287 Using Multimodal LLMs to Understand UI Elements on Websites 11 points| daniel_mp | 1 year ago |qa.tech | reply 5 comments order hn newest [+] [-] patricklef|1 year ago|reply MLLMs are surprisingly bad at this out of the box and to some extent even with fine tuning. https://jina.ai/news/the-what-and-why-of-text-image-modality... [+] [-] antonoo|1 year ago|reply Love that you made an interactive app to visualize how the model performs similar to how Meta AI usually releases their models(direct link: https://qa-tech-minicpm-demo.gptengineer.run) [+] [-] antonoo|1 year ago|reply Did you use AI to generate this app? load replies (1) [+] [-] while1|1 year ago|reply Loving this! Very surprising that the LLMs of today are so bad at understanding interfaces but it also makes it a very interesting case for finetuning! [+] [-] albinekb|1 year ago|reply [deleted]
[+] [-] patricklef|1 year ago|reply MLLMs are surprisingly bad at this out of the box and to some extent even with fine tuning. https://jina.ai/news/the-what-and-why-of-text-image-modality...
[+] [-] antonoo|1 year ago|reply Love that you made an interactive app to visualize how the model performs similar to how Meta AI usually releases their models(direct link: https://qa-tech-minicpm-demo.gptengineer.run) [+] [-] antonoo|1 year ago|reply Did you use AI to generate this app? load replies (1)
[+] [-] while1|1 year ago|reply Loving this! Very surprising that the LLMs of today are so bad at understanding interfaces but it also makes it a very interesting case for finetuning!
[+] [-] patricklef|1 year ago|reply
[+] [-] antonoo|1 year ago|reply
(direct link: https://qa-tech-minicpm-demo.gptengineer.run)
[+] [-] antonoo|1 year ago|reply
[+] [-] while1|1 year ago|reply
[+] [-] albinekb|1 year ago|reply
[deleted]