(no title)
freezed8 | 7 months ago
ColPali is great for retrieval, but you can't use ColPali (at least natively) for pure document parsing tasks. There's a lot of separate benchmarks for just evaluating doc parsing while the author mostly talks about visual retrieval benchmarks.
2. This whole idea of "You can DIY document parsing by screenshotting a page" is not new at all, lots of people have been talking about it! It's certainly fine as a baseline and does work better than standard OCR in many cases.
a. But from our experience there's still a long-tail of accuracy issues. b. It's missing metadata like confidence scores/bounding boxes etc. out of the box c. Honestly this is underrated, but creating a good screenshotting pipeline itself is non-trivial.
3. In general for retrieval, it's helpful to have both text and image representations. Image tokens are obviously much more powerful. Text tokens are way cheaper to store and let you do things like retrieval entire documents (instead of just chunks) and input that into the LLM.
(disclaimer: I am ceo of llamaindex, and we have worked on both document parsing and retrieval with LlamaCloud, but I hope my point stands in a general sense)
No comments yet.