(no title)
lewisl9029 | 1 year ago
All the models I've tried (Sonnet 3.5, GPT 4o, Llama 3.2, Qwen2 VL) have been pretty good at extracting text, but they failed miserably at finding bounding boxes, usually just making up random coordinates. I thought this might have been due to internal resizing of images so tried to get them to use relative % based coordinates, but no luck there either.
Eventually gave up and went back to good old PP-OCR models (are these still state of the art? would love to try out some better ones). The actual extraction feels a bit less accurate than the best LLMs, but bounding box detection is pretty much spot on all the time, and it's literally several orders of magnitude more efficient in terms of memory and overall energy use.
My conclusion was that current gen models still just aren't capable enough yet, but I can't help but feel like I might be missing something. How the heck did Anthropic and OpenAI manage to build computer use if their models can't give them accurate coordinates of objects in screenshots?
ahzhou|1 year ago
You may also be able to get the computer use API to draw bounding boxes if the costs make sense.
That said, I think the correct solution is likely to use a non-VLM to draw bounding boxes. Depends on the dataset and problem.
1. https://www.anthropic.com/news/developing-computer-use 2. https://huggingface.co/blog/paligemma
nostrebored|1 year ago
PaliGemma seems to fit into a completely different niche right now (VQA and Segmentation) that I don't really see having practical applications for computer use.
[1] https://huggingface.co/microsoft/OmniParser?language=python [2] https://github.com/browser-use/browser-use
HanClinto|1 year ago
There are some VLLMs that seem to be specifically trained to do bounding box detection (Moondream comes to mind as one that advertises this?), but in general I wouldn't be surprised if none of them work as well as traditional methods.
parsakhaz|1 year ago
DougBTX|1 year ago
https://simonwillison.net/2024/Aug/26/gemini-bounding-box-vi...
jonnycoder|1 year ago
whiplash451|1 year ago
2. BigTech and SmallTech train their fancy bounding box / detection models on large datasets that have been built using classical detectors and a ton of manual curation
bob1029|1 year ago
This makes sense to me. These LLMs likely have no statistics about the spatial relationships of tokens in a 2D raster space.
nostrebored|1 year ago
[1] https://huggingface.co/osunlp/UGround-V1-7B?language=python
KTibow|1 year ago
owkman|1 year ago
aaronharnly|1 year ago
nyrikki|1 year ago
That is one of the implications of transformers being DLOGTIME-uniform TC0, they don't have access to counter analogs.
You would need to move to log depth circuits, add mod-p_n gates etc... unless someone finds some new mathematics.
Proposition 6.14 in Immerman is where this is lost if you want a cite.
It will be counterintuitive that division is in TC0, but (general) counting is not.
prettyblocks|1 year ago
parsakhaz|1 year ago
vonneumannstan|1 year ago