Show HN: Vision-Based, Vectorless RAG for Long Douments
6 points| page_index | 4 months ago |github.com
Traditional OCR systems typically use a two-stage process that first detects the layout of a PDF — dividing it into text, tables, and images — and then recognizes and converts these elements into plain text. With the rise of vision-language models (VLMs) (such as Qwen-VL and GPT-4.1), new end-to-end OCR models like DeepSeek-OCR have emerged. These models jointly understand visual and textual information, enabling direct interpretation of PDFs without an explicit layout detection step.
However, this paradigm shift raises an important question:
> If a VLM can already process both the document images and the query to produce an answer directly, do we still need the intermediate OCR step?
We build a practical implementation of a vision-based question-answering system for long documents, without relying on OCR. Specifically, we adopt a reasoning-based retrieval layer and the multimodal GPT-4.1 as the VLM for visual reasoning and answer generation.
No comments yet.