top | item 43620203

Show HN: onprem unstructured data extraction with 4 lines of code

8 points| souvik3333 | 10 months ago |github.com

The traditional pipeline for unstructured data extraction typically follows these steps: 1. Image → OCR Model (e.g., Google Vision) → Layout Model (e.g. Surya) → LLM → Final Answer

However, this can be streamlined using a Vision-Language Model (VLM): 2. Image → VLM → Final Answer

Recently VLMs have improved a lot for OCR and document understanding tasks, specifically the Qwen-2.5-VL series. We can run the Qwen-2.5-VL-7B-AWQ model locally with just 16GB VRAM, and perform end-to-end information extraction (fields and table extraction) without any external models.

Hallucination with VLMs One question I am asked often is hallucination with VLMs compared to OCR model. This is a valid point. But, even with correct OCR and Layout formatting, LLM can still hallucinate, and can give incorrect final answers. Layout models often struggle with complex documents (e.g., tables, complex sparse document). If the formatted text from the layout model is incorrect the LLM model will always produce incorrect extraction with high confidence.

Check out our GitHub repo for implementation details: GitHub: https://github.com/NanoNets/docext

Would love to hear suggestions for improvement!

discuss

No comments yet.