Show HN: Open-source Rule-based PDF parser for RAG
293 points| jnathsf | 2 years ago |github.com
The PDF Parser offers the following features:
* Sections and subsections along with their levels. * Paragraphs - combines lines. * Links between sections and paragraphs. * Tables along with the section the tables are found in. * Lists and nested lists. * Join content spread across pages. * Removal of repeating headers and footers. * Watermark removal. * OCR with boundary boxes
dmezzetti|2 years ago
dmezzetti|2 years ago
Here's a couple examples:
- https://neuml.hashnode.dev/build-rag-pipelines-with-txtai
- https://neuml.hashnode.dev/extract-text-from-documents
Disclaimer: I'm the primary author of txtai (https://github.com/neuml/txtai).
mpeg|2 years ago
Currently I'm using a mix of MuPDF + AWS Textract (for tables, mostly) but I'd love to understand what other people are doing
epaga|2 years ago
Parsing PDFs can be quite the headache because the format is so complex. We support most of these features already but there are always so many edge cases that additional angles can be very helpful.
muzamil-ali|2 years ago
lmeyerov|2 years ago
There are now a lot of file loaders for RAG (langchain, LLMindex, unstructured, ...), any reasons, like a leading benchmark score, to prefer this one?
mpeg|2 years ago
However, I have a PDF parsing use-case that I tried those RAG tools for, but the output they give me is pretty low quality – it kinda works for RAG as the LLM can work around the issues but if you want to get higher quality responses with proper references and such I think the best way is to write your own rule-based parser which is what I ended up doing (based on MuPDF though, not Tika).
Maybe that's what the authors of this tool were thinking too.
rmsaksida|2 years ago
mistrial9|2 years ago
Overall, I believe there has to be some middle ground for identification and trust building over time, between "hidden group with no names on $CORP secure site" and other traditional means of introduction and trust building.
thanks for posting this interesting and relevant work
asukla|2 years ago
firtoz|2 years ago
asukla|2 years ago
Some examples are here with notebook: https://github.com/nlmatics/llmsherpa Here's another notebook from the repo with examples: https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks...
huqedato|2 years ago
guidedlight|2 years ago
asukla|2 years ago
You can use llmsherpa library - https://github.com/nlmatics/llmsherpa with this server to get nice layout friendly chunks for your LLM/RAG project.
ramoz|2 years ago
What this library, and something like fitz/pymupdf, allow you to do is extract the text straight from the pdf, using rules about how to parse & structure it. (Most modern pdfs you can extract text without ocr).
- much cheaper obviously but doesn’t scale (across dynamic layouts) well so you likely are using this when you can configure around a standard structure. I have found rule-based text extraction to work fairly dynamically though for things like scientific pdfs.
StrauXX|2 years ago
cdolan|2 years ago
I fear tesseract OCR is a potential limitation though. I’ve seen it make so many mistakes
jvdvegt|2 years ago
asukla|2 years ago
xfalcox|2 years ago
ilaksh|2 years ago
Looks like Apache 2 license which is nice.
genewitch|2 years ago