top | item 45773923

Show HN: Vision-Based, Vectorless RAG for Long Douments

6 points| page_index | 4 months ago |github.com

In modern document question answering (QA) systems, Optical Character Recognition (OCR) serves an important role by converting PDF pages into text that can be processed by Large Language Models (LLMs). The resulting text can provide contextual input that enables LLMs to perform question answering over document content.

Traditional OCR systems typically use a two-stage process that first detects the layout of a PDF — dividing it into text, tables, and images — and then recognizes and converts these elements into plain text. With the rise of vision-language models (VLMs) (such as Qwen-VL and GPT-4.1), new end-to-end OCR models like DeepSeek-OCR have emerged. These models jointly understand visual and textual information, enabling direct interpretation of PDFs without an explicit layout detection step.

However, this paradigm shift raises an important question:

> If a VLM can already process both the document images and the query to produce an answer directly, do we still need the intermediate OCR step?

We build a practical implementation of a vision-based question-answering system for long documents, without relying on OCR. Specifically, we adopt a reasoning-based retrieval layer and the multimodal GPT-4.1 as the VLM for visual reasoning and answer generation.

discuss

No comments yet.