top | item 39113972

Show HN: Open-source Rule-based PDF parser for RAG

293 points| jnathsf | 2 years ago |github.com

The PDF parser is a rule based parser which uses text co-ordinates (boundary box), graphics and font data. The PDF parser works off text layer and also offers a OCR option to automatically use OCR if there are scanned pages in your PDFs. The OCR feature is based off a modified version of tika which uses tesseract underneath.

The PDF Parser offers the following features:

* Sections and subsections along with their levels. * Paragraphs - combines lines. * Links between sections and paragraphs. * Tables along with the section the tables are found in. * Lists and nested lists. * Join content spread across pages. * Removal of repeating headers and footers. * Watermark removal. * OCR with boundary boxes

32 comments

order

dmezzetti|2 years ago

Nice project! I've long used Tika for document parsing given it's maturity and wide number of formats supported. The XHTML output helps with chunking documents for RAG.

Here's a couple examples:

- https://neuml.hashnode.dev/build-rag-pipelines-with-txtai

- https://neuml.hashnode.dev/extract-text-from-documents

Disclaimer: I'm the primary author of txtai (https://github.com/neuml/txtai).

mpeg|2 years ago

Off-topic, but do you know how Tika compares to other pdf parsing libraries? I was very unimpressed by pdfminer.six (what unstructured uses) as the layout detection seems pretty basic, it fails to parse multi column text, whereas MuPDF does it perfectly

Currently I'm using a mix of MuPDF + AWS Textract (for tables, mostly) but I'd love to understand what other people are doing

epaga|2 years ago

This looks like it could be very helpful. The company I work for has a PDF comparison tool called "PDFC" which can read PDFs and runs comparisons of semantic differences. https://www.inetsoftware.de/products/pdf-content-comparer

Parsing PDFs can be quite the headache because the format is so complex. We support most of these features already but there are always so many edge cases that additional angles can be very helpful.

muzamil-ali|2 years ago

You're absolutely right; parsing PDFs can be a real headache due to their inherent complexity. The format itself can vary in structure, layout, and embedded components, making it difficult to extract and compare information consistently. Even with robust tools like PDFC, edge cases can always emerge, requiring further refinements.

lmeyerov|2 years ago

Tesseract OCR fallback sounds great!

There are now a lot of file loaders for RAG (langchain, LLMindex, unstructured, ...), any reasons, like a leading benchmark score, to prefer this one?

mpeg|2 years ago

I couldn't try this tool as it doesn't build on apple silicon (and there's no ARM docker image)

However, I have a PDF parsing use-case that I tried those RAG tools for, but the output they give me is pretty low quality – it kinda works for RAG as the LLM can work around the issues but if you want to get higher quality responses with proper references and such I think the best way is to write your own rule-based parser which is what I ended up doing (based on MuPDF though, not Tika).

Maybe that's what the authors of this tool were thinking too.

rmsaksida|2 years ago

Last time I tried Langchain (admittedly, that was ~6 months ago) the implementations for content extraction from PDFs and HTML files were very basic. Enough to get a prototype RAG solution going, but not enough to build anything reliable. This looks like a much more battle-tested implementation.

mistrial9|2 years ago

great effort and very interesting. However, I go to Github and I see "This organization has no public members" .. I do not know who you are at all, or what else might be part of this without disclosure.

Overall, I believe there has to be some middle ground for identification and trust building over time, between "hidden group with no names on $CORP secure site" and other traditional means of introduction and trust building.

thanks for posting this interesting and relevant work

asukla|2 years ago

Thanks for the post. Please use this server with the llmsherpa LayoutPDFReader to get optimal chunks for your LLM/RAG project: https://github.com/nlmatics/llmsherpa. See examples and notebook in the repo.

huqedato|2 years ago

I tried to parse a few hundreds pdfs with it. The results are pretty decent. If this was developed in Julia, it would be ten times faster (at least).

guidedlight|2 years ago

How does this differ from Azure Document Intelligence, or are they effectively the same thing?

asukla|2 years ago

No, we are not doing the same thing. Most cloud parsers use a vision model and they are lot slower, expensive and you need to write code on the top of these to extract good chunks.

You can use llmsherpa library - https://github.com/nlmatics/llmsherpa with this server to get nice layout friendly chunks for your LLM/RAG project.

ramoz|2 years ago

There’s no ocr or ai involved here (other than the standard fallback).

What this library, and something like fitz/pymupdf, allow you to do is extract the text straight from the pdf, using rules about how to parse & structure it. (Most modern pdfs you can extract text without ocr).

- much cheaper obviously but doesn’t scale (across dynamic layouts) well so you likely are using this when you can configure around a standard structure. I have found rule-based text extraction to work fairly dynamically though for things like scientific pdfs.

StrauXX|2 years ago

Last I used it, Azure Document Intelligence wasn't all that smart about choosing split points. This seems to implement better heuristics.

cdolan|2 years ago

I am also curious about this. ADI is reliable but does have edge case issues on malformed PDF

I fear tesseract OCR is a potential limitation though. I’ve seen it make so many mistakes

jvdvegt|2 years ago

Do you ave any examples? There doesn't seem to be a single PDF file in the repo.

asukla|2 years ago

You can see examples in llmsherpa project - https://github.com/nlmatics/llmsherpa. This project nlm-ingestor provides you the backend to work with llmsherpa. The llmsherpa library is very convenient to use for extracting nice chunks for your LLM/RAG project.

xfalcox|2 years ago

We've been looking for something exactly like this, thanks for sharing!

ilaksh|2 years ago

How does this compare to PaddleOCR?

Looks like Apache 2 license which is nice.

genewitch|2 years ago

"Retrieval Augmented Generation"