Show HN: onprem unstructured data extraction with 4 lines of code
8 points| souvik3333 | 10 months ago |github.com
However, this can be streamlined using a Vision-Language Model (VLM): 2. Image → VLM → Final Answer
Recently VLMs have improved a lot for OCR and document understanding tasks, specifically the Qwen-2.5-VL series. We can run the Qwen-2.5-VL-7B-AWQ model locally with just 16GB VRAM, and perform end-to-end information extraction (fields and table extraction) without any external models.
Hallucination with VLMs One question I am asked often is hallucination with VLMs compared to OCR model. This is a valid point. But, even with correct OCR and Layout formatting, LLM can still hallucinate, and can give incorrect final answers. Layout models often struggle with complex documents (e.g., tables, complex sparse document). If the formatted text from the layout model is incorrect the LLM model will always produce incorrect extraction with high confidence.
Check out our GitHub repo for implementation details: GitHub: https://github.com/NanoNets/docext
Would love to hear suggestions for improvement!
No comments yet.