top | item 45805057

(no title)

page_index | 3 months ago

In modern document question answering (QA) systems, OCR serves an important role by converting PDF pages into text that can be processed by Large Language Models (LLMs). The resulting text can provide contextual input that enables LLMs to perform question answering over document content.

Traditional OCR systems typically use a two-stage process that first detects the layout of a PDF — dividing it into text, tables, and images — and then recognizes and converts these elements into plain text. With the rise of vision-language models (VLMs) (such as Qwen-VL and GPT-4.1), new end-to-end OCR models like DeepSeek-OCR have emerged. These models jointly understand visual and textual information, enabling direct interpretation of PDFs without an explicit layout detection step.

However, this paradigm shift raises an important question:

> If a VLM can already process both the document images and the query to produce an answer directly, do we still need the intermediate OCR step?

We build a practical vectorless, vision-based question-answering implementation for long documents, without relying on OCR. Specifically, we adopt a vectlorless, reasoning-based retrieval layer and the multimodal GPT-4.1 as the VLM for visual reasoning and answer generation.

discuss

No comments yet.