top | item 41943583

(no title)

drothlis | 1 year ago

> Claude's ability to count pixels and interact with a screen using precise coordinate

I guess you mean its "Computer use" API that can (if I understand correctly) send mouse click at specific coordinates?

I got excited thinking Claude can finally do accurate object detection, but alas no. Here's its output:

> Looking at the image directly, the SPACE key appears near the bottom left of the keyboard interface, but I cannot determine its exact pixel coordinates just by looking at the image. I can see it's positioned below the letter grid and appears wider than the regular letter keys, but I apologize - I cannot reliably extract specific pixel coordinates from just viewing the screenshot.

This is 3.5 Sonnet (their most current model).

And they explicitly call out spatial reasoning as a limitation:

> Claude’s spatial reasoning abilities are limited. It may struggle with tasks requiring precise localization or layouts, like reading an analog clock face or describing exact positions of chess pieces.

--https://docs.anthropic.com/en/docs/build-with-claude/vision#...

Since 2022 I occasionally dip in and test this use-case with the latest models but haven't seen much progress on the spatial reasoning. The multi-modality has been a neat addition though.

discuss

order

philipbjorge|1 year ago

They report that they trained the model to count pixels and based on accurate mouse clicks coming out of it, it seems to be the case for at least some code path.

> When a developer tasks Claude with using a piece of computer software and gives it the necessary access, Claude looks at screenshots of what’s visible to the user, then counts how many pixels vertically or horizontally it needs to move a cursor in order to click in the correct place. Training Claude to count pixels accurately was critical.

wintonzheng|1 year ago

Curious: what use cases do you use to test the spacial reasoning ability of these models?