top | item 42813043

(no title)

lewisl9029 | 1 year ago

I had a somewhat similar experience trying to use LLMs to do OCR.

All the models I've tried (Sonnet 3.5, GPT 4o, Llama 3.2, Qwen2 VL) have been pretty good at extracting text, but they failed miserably at finding bounding boxes, usually just making up random coordinates. I thought this might have been due to internal resizing of images so tried to get them to use relative % based coordinates, but no luck there either.

Eventually gave up and went back to good old PP-OCR models (are these still state of the art? would love to try out some better ones). The actual extraction feels a bit less accurate than the best LLMs, but bounding box detection is pretty much spot on all the time, and it's literally several orders of magnitude more efficient in terms of memory and overall energy use.

My conclusion was that current gen models still just aren't capable enough yet, but I can't help but feel like I might be missing something. How the heck did Anthropic and OpenAI manage to build computer use if their models can't give them accurate coordinates of objects in screenshots?

discuss

order

ahzhou|1 year ago

LLMs are inherently bad at this due to tokenization, scaling, and lack of training on the task. Anthropic’s computer use feature has a specialized model for pixel-counting: > Training Claude to count pixels accurately was critical. Without this skill, the model finds it difficult to give mouse commands. [1] For a VLM trained on identifying bounding boxes, check out PaliGemma [2]

You may also be able to get the computer use API to draw bounding boxes if the costs make sense.

That said, I think the correct solution is likely to use a non-VLM to draw bounding boxes. Depends on the dataset and problem.

1. https://www.anthropic.com/news/developing-computer-use 2. https://huggingface.co/blog/paligemma

nostrebored|1 year ago

PaliGemma on computer use data is absolutely not good. The difference between a FT YOLO model and a FT PaliGemma model is huge if generic bboxes are what you need. Microsoft's OmniParser also winds up using a YOLO backbone [1]. All of the browser use tools (like our friends at browser-use [2]) wind up trying to get a generic set of bboxes using the DOM and then applying generative models.

PaliGemma seems to fit into a completely different niche right now (VQA and Segmentation) that I don't really see having practical applications for computer use.

[1] https://huggingface.co/microsoft/OmniParser?language=python [2] https://github.com/browser-use/browser-use

HanClinto|1 year ago

Maybe still worth it to separate the tasks, and use a traditional text detection model to find bounding boxes, then crop the images. In a second stage, send those cropped samples to the higher-power LLMs to do the actual text extraction, and don't worry about them for bounding boxes at all.

There are some VLLMs that seem to be specifically trained to do bounding box detection (Moondream comes to mind as one that advertises this?), but in general I wouldn't be surprised if none of them work as well as traditional methods.

parsakhaz|1 year ago

We've run a couple experiments and have found that our open vision language model Moondream works better than YOLOv11 in general cases. If accuracy matters most, it's worth trying our vision language model. If you need real-time results, you can train YOLO models using data from our model. We have a space for video redaction, that is just object detection, on our Hugging Face. We also have a playground online to try it out.

jonnycoder|1 year ago

I am doing OCR on hundreds of PDFs using AWS Textract. It requires me to convert each page of the pdf to an image and then analyze the image and it works good for converting to markdown format (which requires custom code). I want to try using some vision models and compare how they do, for example Phi-3.5-vision-instruct.

whiplash451|1 year ago

1. You need to look into the OCR-specific literature of DL (e.g. udop) or segmentation-based (e.g. segment-anything)

2. BigTech and SmallTech train their fancy bounding box / detection models on large datasets that have been built using classical detectors and a ton of manual curation

bob1029|1 year ago

> they failed miserably at finding bounding boxes, usually just making up random coordinates.

This makes sense to me. These LLMs likely have no statistics about the spatial relationships of tokens in a 2D raster space.

KTibow|1 year ago

Gemini 2 can purportedly do this, you can test it with the Spatial Understanding Starter App inside AI Studio. Only caveat is that it's not production ready yet.

owkman|1 year ago

I think people have had success with using PaliGemma for this. The computer use type use cases probably use fine tuned versions of LLMs for their use cases rather than the base ones.

aaronharnly|1 year ago

Relatedly, we find LLM vision models absolutely atrocious at counting things. We build school curricula, and one basic task for our activities is counting – blocks, pictures of ducks, segments in a chart, whatever. Current LLM models can't reliably count four or five squares in an image.

nyrikki|1 year ago

IMHO, that is expected, at least for the general case.

That is one of the implications of transformers being DLOGTIME-uniform TC0, they don't have access to counter analogs.

You would need to move to log depth circuits, add mod-p_n gates etc... unless someone finds some new mathematics.

Proposition 6.14 in Immerman is where this is lost if you want a cite.

It will be counterintuitive that division is in TC0, but (general) counting is not.

prettyblocks|1 year ago

Have you played with moondream? Pretty cool small vision model that did a good job with bounding boxes when I palyed with it.

parsakhaz|1 year ago

Thanks for the shout out :)

vonneumannstan|1 year ago

Yeah I really struggle when I use my hammer to screw pieces of wood together too.