(no title)
Flux159 | 1 month ago
But it also gets to one of Claude's (Opus 4.5) current weaknesses - image understanding. Claude really isn't able to understand details of images in the same way that people currently can - this is also explained well with an analysis of Claude Plays Pokemon https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i.... I think over the next few years we'll probably see all major LLM companies work on resolving these weaknesses & then LLMs using UIs will work significantly better (and eventually get to proper video stream understanding as well - not 'take a screenshot every 500ms' and call that video understanding).
ElatedOwl|1 month ago
I was running some sentiment analysis experiments; describe the subject and the subjects emotional state kind of thing. It picked up on a lot of little detail; the brand name of my guitar amplifier in the background, what my t shirt said and that I must enjoy craft beer and or running (it was a craft beer 5k kind of thing), and picked up on my movement through multiple frames. This was a video slicing a frame every 500ms, it noticed me flexing, giving the finger, appearing happy, angry, etc. I was really surprised how much it picked up on, and how well it connected those dots together.
Wowfunhappy|1 month ago
I can describe what is wrong with the screenshot to make Claude fix the problem, but it's not entirely clear to what extent it's using the screenshot versus my description. Any human with two brain cells wouldn't need the problems pointed out.
EMM_386|1 month ago
Are you sure about that?
Try "claude --chrome" with the CLI tool and watch what it does in the web browser.
It takes screenshots all the time to feed back into the multimodal vision and help it navigate.
It can look at the HTML or the JavaScript but Claude seems to find it "easier" to take a screenshot to find out what exactly is on the screen. Not parse the DOM.
So I don't know how Cowork does this, but there is no reason it couldn't be doing the same thing.
dalenw|1 month ago
And I do know there are ways to hide data like watermarks in images but I do not know if that would be able to poison an AI.
oracleclyde|1 month ago
re5i5tor|1 month ago
minimaxir|1 month ago
The issue is that Claude Code won't automatically Read images by default as a part of its flow: you have to very explicitly prompt it to do so. I suspect a Skill may be more useful here.
spike021|1 month ago
Occasionally it needs some poking and prodding but not to a substantial degree.
I also was able to use it to generate SVG files based on in-app design using screenshots and code that handles rendering the UI and it was able to do a decent job. Granted not the most complex of SVG but the process worked.