top | item 39448395

(no title)

Hmm, I have to say I'm pretty unimpressed with my initial experience here.

1. The sign up with email just endlessly redirected, click link in email, ask to sign up with email, put in email, click link in email, etc.

2. Fine, I'll sign in with Google.

3. A PDF parser? Seriously that's what all this fuss is about? There are so many options already out there, PDFBox, iText, Unstructured, PyPDF, PDF.js, PdfMiner not to mention extraction services available from the hyperscalers. Super confused why anyone needs this.

discuss

verdverm|2 years ago

LLaMA Index is way more than a PDF parser. It's the most widely used RAG tool chain and their cloud looks to be a managed version of that.

Specific to the parser, they do show where tools like those you mentioned fail and their LLM based parser captures the full data the aforementioned miss.

kurts_mustache|2 years ago

Yeah, but their platform is basically a janky PDF parser which is why I don't understand what the hype is about.

It's easy to cherry pick a PDF for marketing purposes and claim you're better. I didn't miss it, I just don't believe marketing announcements at face value. I tried their parser on a PDF with a bit of complex formatting like multiple columns, tables and a couple images and it choked, spitting out one big markdown header with jumbled text. Not impressed.

cheesyFishes|2 years ago

There is a PDF parser, LlamaParse, (which is open to everyone), and a managed ingestion/retrieval service, that is currently invite-only.

Planning broader releases in the future for sure.

asukla|2 years ago

To get good RAG performance you will need a good chunking strategy. Simply getting all the text is not good enough and knowing the boundaries of table, list, paragraph, section etc. is helpful.

Great work by llamaindex team. Also feel free to try https://github.com/nlmatics/llmsherpa which takes into account some of the things I mentioned.