top | item 43187919

(no title)

gfiorav | 1 year ago

I wonder what the speed of this approach vs traditional ocr techniques. Also, curious if this could be used for text detection (find a bounding box containing text within an image).

discuss

order

vunderba|1 year ago

Was just coming here to say this, there does not yet exist a multimodal vision LLM approach that is capable of identifying bounding boxes of where the text occurs. I suppose you could manually cut the image up and send each part separately to the LLM but that feels like an kludge and it's still in-exact.

chpatrick|1 year ago

qwen 2.5 vl was specifically trained to produce bounding boxes I believe.