(no title)
gillesjacobs | 6 months ago
If you have scanned documents, last I checked Gemini Flash was very good cost/performance wise for document extraction. Mistral OCR claims better performance in their benchmarks but people I know used it and other benchmarks beg to differ. Personally I use Azure Document Intelligence a lot for the bounding boxes feature, but Gemini Flash apparently has this covered too.
https://getomni.ai/blog/ocr-benchmark
Sidenote: What you want for RAG is not OCR as-in extracting text. The task for RAG preprocessing is typically called Document Layout Analysis or End-to-End Document Parsing/Extraction.
Good RAG is multimodal and semantic document structure and layout-aware so your pipeline needs to extract and recognize text sections, footers/headers, images, and tables. When working with PDFs you want accurate bounding boxes in your metadata for referring your users to retrieved sources etc.
mingtianzhang|6 months ago
gillesjacobs|6 months ago
uri_merhav|5 months ago
malshe|6 months ago
Got it. Indeed, I need to do End-to-End Document Parsing/Extraction.