Unless the only thing you want is a description of the image, then the real answer is NO. You can get an LLM to do something like "If you encounter an image that is not easily convertable to standard markdown, insert a [[DESCRIPTION OF IMAGE]] here." placeholder, but at that point you've lost information that may be salient to the original PDF.
The reason is because these multimodal LLMs can give you descriptions/OCR/etc., but they cannot give you quantifiable information related to placement.
So if there was a picture of a tiger in the middle of the page converted to a bitmap, you couldn't get the LLM to give you something like this: "Image detected at pixel position (120, 200) - (240, 500)." - because that's really want you want.
You almost need segmentation system middleware that the LLM can forward to which can cut out these images to use in markdown syntax:
vunderba|1 year ago
The reason is because these multimodal LLMs can give you descriptions/OCR/etc., but they cannot give you quantifiable information related to placement.
So if there was a picture of a tiger in the middle of the page converted to a bitmap, you couldn't get the LLM to give you something like this: "Image detected at pixel position (120, 200) - (240, 500)." - because that's really want you want.
You almost need segmentation system middleware that the LLM can forward to which can cut out these images to use in markdown syntax:
unknown|1 year ago
[deleted]
yigitkonur35|1 year ago