top | item 43187919

(no title)

gfiorav | 1 year ago

I wonder what the speed of this approach vs traditional ocr techniques. Also, curious if this could be used for text detection (find a bounding box containing text within an image).

discuss

vunderba|1 year ago

Was just coming here to say this, there does not yet exist a multimodal vision LLM approach that is capable of identifying bounding boxes of where the text occurs. I suppose you could manually cut the image up and send each part separately to the LLM but that feels like an kludge and it's still in-exact.

EarlyOom|1 year ago

We can do bounding boxes too :) we just call it visual grounding https://github.com/vlm-run/vlmrun-cookbook/blob/main/noteboo...

chpatrick|1 year ago

qwen 2.5 vl was specifically trained to produce bounding boxes I believe.