(no title)
ahzhou | 1 year ago
You may also be able to get the computer use API to draw bounding boxes if the costs make sense.
That said, I think the correct solution is likely to use a non-VLM to draw bounding boxes. Depends on the dataset and problem.
1. https://www.anthropic.com/news/developing-computer-use 2. https://huggingface.co/blog/paligemma
nostrebored|1 year ago
PaliGemma seems to fit into a completely different niche right now (VQA and Segmentation) that I don't really see having practical applications for computer use.
[1] https://huggingface.co/microsoft/OmniParser?language=python [2] https://github.com/browser-use/browser-use