top | item 42966640 (no title) iiJDSii | 1 year ago What does your perception look like, are you using raw screenshots? GUI snapshots? Vision is very difficult for these, and snapshots are incomplete, is what I've found in some earlier experiments. discuss order hn newest mountainriver|1 year ago Perception is just 1-2 screenshots. A number of recent VLM models have a lot more pretraining data on GUI interactions, which helps. iiJDSii|1 year ago Such as? Are they able to recognize arbitrary GUI elements from various desktop programs, web browsers, etc? load replies (1)
mountainriver|1 year ago Perception is just 1-2 screenshots. A number of recent VLM models have a lot more pretraining data on GUI interactions, which helps. iiJDSii|1 year ago Such as? Are they able to recognize arbitrary GUI elements from various desktop programs, web browsers, etc? load replies (1)
iiJDSii|1 year ago Such as? Are they able to recognize arbitrary GUI elements from various desktop programs, web browsers, etc? load replies (1)
mountainriver|1 year ago
iiJDSii|1 year ago