(no title)
bigzyg33k | 4 months ago
The purpose of the system described in this post is OCR inaccuracies - it's convenient to use LLMs for OCR of PDFs because PDFs do not have standard layouts - just using the text strings extracted from the PDFs code results in incorrect paragraph/sentence sequencing.
The way they *should* have used RAG is to ensure that subsentence strings extracted via LLM appear in the PDF at all, but it appears they were just trusting the output without automated validation of the OCR.
eoinbmorg|4 months ago
I could be totally wrong here.
simonw|4 months ago
jaccola|4 months ago
Though I imagine scenarios where the PDF is just an image (e.g. a scan of a form), and thus the validation would not work.