(no title)
pierre | 1 year ago
The hard part is to prevent the model ignoring some part of the page and halucinations (see some of the gpt4o sample here like the xanax notice:https://www.llamaindex.ai/blog/introducing-llamaparse-premiu...)
However this model will get better and we may soon have a good pdf to md model.
fzysingularity|1 year ago
- VLMs are way better at handling layout and context where OCR systems fail miserably
- VLMs read documents like humans do, which makes dealing with special layouts like bullets, tables, charts, footnotes much more tractable with a singular approach rather than have to special case a whole bunch of OCR + post-processing
- VLMs are definitely more expensive, but can be specialized and distilled for accurate and cost effective inference
In general, I think vision + LLMs can be trained to explicitly to “extract” information and avoid reasoning/hallucinating about the text. The reasoning can be another module altogether.
yigitkonur35|1 year ago
authorfly|1 year ago
If your old school OCR output has output that is not present in the visual one, but is coherent (e.g. english sentences), you could get it back and slot it into the missing place from the visual output.
yigitkonur35|1 year ago
fkilaiwi|1 year ago
perrywky|1 year ago