top | item 41107862

(no title)

reerdna | 1 year ago

For use in retrieval/RAG, an emerging paradigm is to not parse the PDF at all.

By using a multi-modal foundation model, you convert visual representations ("screenshots") of the pdf directly into searchable vector representations.

Paper: Efficient Document Retrieval with Vision Language Models - https://arxiv.org/abs/2407.01449

Vespa.ai blog post https://blog.vespa.ai/retrieval-with-vision-language-models-... (my day job)

discuss

attilakun|1 year ago

I do something similar in my file-renamer app (sort.photos if you want to check it out):

1. Render first 2 pages of PDF into a JPEG offline in the Mac app.

2. Upload JPEG to ChatGPT Vision and ask what would be a good file name for this.

It works surprisingly well.

qeternity|1 year ago

I'm sure this will change over time, but I have yet to see an LMM that performs (on average) as well as decent text extraction pipelines.

Text embeddings for text also have much better recall in my tests.

infecto|1 year ago

No multi-modal model is ready for that in reality. The accuracy from other tools to extract tables and text are far superior.

authorfly|1 year ago

You have detractors, but this is the future.

cpursley|1 year ago

Is anyone actually having success with this approach? If so, how and with what models (and prompts)?

distracted_boy|1 year ago

Claude.ai handles tables very well, at least in my tests. It could easily convert a table from a financial document into a markdown table, among other things.