top | item 42010617

(no title)

Not at the moment, since you need a local model with strong segmentation capabilities (x, y) and none exist ATM. We hope to train one in the future and one of Cerebellum's roadmap items is to create a the ability to save your sessions as a training dataset.

discuss

Jayakumark|1 year ago

Any idea on how does Sonnet does this, is the image annotated with bounding boxes on text boxes etc. along with its coordinates before sending to sonnet and it responds with box name back or co-ordinate back or ? is SAM2 used for segmenting everything before sending to sonnet ?

theredsix|1 year ago

They don't discuss this at all on their blog other than "Training Claude to count pixels accurately was critical." My speculation on how they accomplished it is either explicit tokenizer support with spacial encoding similar to how single-digit tokenization improves math abilities or an extensive pretraining like Molmo.

digdugdirk|1 year ago

Do you not think it could work with a shim layer that handled the browser interaction via code and selenium?

theredsix|1 year ago

Selenium works on webdriver v4 and the screenshot is transferred as an image by the webdriver protocol. Perhaps modifying DOM before triggering the screenshot and then reverting the changes can work. PRs are welcome!