(no title)
reerdna | 1 year ago
By using a multi-modal foundation model, you convert visual representations ("screenshots") of the pdf directly into searchable vector representations.
Paper: Efficient Document Retrieval with Vision Language Models - https://arxiv.org/abs/2407.01449
Vespa.ai blog post https://blog.vespa.ai/retrieval-with-vision-language-models-... (my day job)
attilakun|1 year ago
1. Render first 2 pages of PDF into a JPEG offline in the Mac app.
2. Upload JPEG to ChatGPT Vision and ask what would be a good file name for this.
It works surprisingly well.
qeternity|1 year ago
Text embeddings for text also have much better recall in my tests.
infecto|1 year ago
authorfly|1 year ago
cpursley|1 year ago
distracted_boy|1 year ago