top | item 44641092

(no title)

Adityav369 | 7 months ago

You can ask the model to describe the image, but that is inherently lossy. What if it is a chart and the model gets most x, y pairs, but the user asks about a missing "x" or "y" value. Presenting the image at inference is effective since you're guaranteeing that the LLM is able to answer exactly the user's question. The only blocker here becomes how good retrieval is, and that's a smaller problem to solve. This approach allows us to only solve for passing in relevant context, the rest is taken care of by the LLM, otherwise the problem space expands to correct OCR, parsing, and getting all possible descriptions to images from the model.

discuss

No comments yet.