top | item 44780982

(no title)

diptanu | 7 months ago

Disclaimer - Founder of Tensorlake, we built a Document Parsing API for developers.

This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world. Relying on metadata in files just doesn't scale across different source of PDFs.

We convert PDFs to images, run a layout understanding model on them first, and then apply specialized models like text recognition and table recognition models on them, stitch them back together to get acceptable results for domains where accuracy is table stakes.

discuss

order

vander_elst|6 months ago

It might sound absurd, but on paper this should be the best way to approach the problem.

My understanding is that PDFs are intended to produce an output that is consumed by humans and not by computers, the format seems to be focused on how to display some data so that a human can (hopefully) easily read them. Here it seems that we are using a technique that mimics the human approach, which would seem to make sense.

It is sad though that in 30+ years we didn't manage to add a consistent way to include a way to make a PDF readable by a machine. I wonder what incentives were missing that didn't make this possible. Does anyone maybe have some insight here?

layer8|6 months ago

> It might sound absurd, but on paper this should be the best way to approach the problem.

On paper yes, but for electronic documents? ;)

More seriously: PDF supports all the necessary features, like structure tags. You can create a PDF with the basically the same structural information as an HTML document. The problem is that most PDF-generating workflows don’t bother with it, because it requires care and is more work.

And yes, PDF was originally created as an input format for printing. The “portable” in “PDF” refers to the fact that, unlike PostScript files of the time (1980s), they are not tied to a specific printer make or model.

apt-apt-apt-apt|6 months ago

Probably for the same reason images were not readable by machines.

Except PDFs dangle hope of maybe being machine-readable because they can contain unicode text, while images don't offer this hope.

actionfromafar|6 months ago

1. It's extra work to add an annotation or "internal data format" inside the PDF.

2. By the time the PDF is generated in a real system, the original data source and meaning may be very far off in the data pipeline. It may require incredible cross team and/or cross vendor cooperation.

3. Chicken and egg. There are very few if any machine parseable PDFs out there, so there is little demand for such.

I'm actually much more optimistic of embedding meta data "in-band" with the human readable data, such as a dense QR code or similar.

formerly_proven|6 months ago

Yes, PDFs are primarily a way to describe print data. So to a certain extent the essence of PDF is a hybrid vector-raster image format. Sure, these days text is almost always encoded as or overlaid with actual machine-readable text, but this isn't really necessary and wasn't always done, especially for older PDFs. 15 years ago you couldn't copy (legible) text out of most PDFs made with latex.

lou1306|6 months ago

> the format seems to be focused on how to display some data so that a human can (hopefully) easily read them

It may seem so, but what it really focuses on is how to arrange stuff on a page that has to be printed. Literally everything else, from forms to hyperlinks, were later additions (and it shows, given the crater-size security holes they punched into the format)

immibis|6 months ago

It's Portable Document Format, and the Document refers to paper documents, not computer files.

In other words, this is a way to get a paper document into a computer.

That's why half of them are just images: they were scanned by scanners. Sometimes the images have OCR metadata so you can select text and when you copy and paste it it's wrong.

BobbyTables2|7 months ago

Kinda funny.

Printing a PDF and scanning it for an email it would normally be worthy of major ridicule.

But you’re basically doing that to parse it.

I get it, have heard of others doing the same. Just seems damn frustrating that such is necessary. The world sure doesn’t parse HTML that way!

sbrother|7 months ago

I've built document parsing pipelines for a few clients recently, and yeah this approach yields way superior results using what's currently available. Which is completely absurd, but here we are.

wrs|7 months ago

Maybe not literally that, but the eldritch horrors of parsing real-world HTML are not to be taken lightly!

Muromec|6 months ago

If the html in question would include javascript that renders everything, including text, into a canvas -- yes, this is how you would parse it. And PDF is basically that

raincole|6 months ago

The analogy doesn't work tho. If you print a PNG and scan it for an email you will be ridiculed. But OCRing a PNG is perfectly valid.

sidebute|7 months ago

While we have a PDF internals expert here, I'm itching to ask: Why is mupdf-gl so much faster than everything else? (on vanilla desktop linux)

Its search speed on big pdfs is dramatically faster than everything else I've tried and I've often wondered why the others can't be as fast as mupdf-gl.

Thanks for any insights!

DannyBee|6 months ago

It's funny you ask this - i have spent a time building pdf indexing/search apps on the side over the past few weeks.

I'll give you the rundown. The answer to your specific question is basically "some of them process letter by letter to put text back in order, and some don't. Some build fast trie/etc based indexes to do searching, some don't"

All of my machine manuals/etc are in PDF, and too many search apps/OS search indexers don't make it simple to find things in them. I have a really good app on the mac, but basically nothing on windows. All i want is a dumb single window app that can manage pdf collections, search them for words, and display the results for me. Nothing more or less.

So i built one for my non-mac platforms over the past few weeks. One version in C++ (using QT), one version in .net (using MAUI), for fun.

All told, i'm indexing (for this particular example), 2500 pdfs that have about 150k pages in them.

On the indexing side, lucene and sqlite FTS do a fine job, and no issues - both are fast, and indexing/search is not limited by their speed or capability.

On the pdf parsing/text extraction side, i have tried literally every library that i can find for my ecosystem (about 25). Both commercial and not. I did not try libraries that i know share underlying text extraction/etc engines (IE there are a million pdfium wrappers).

I parse in parallel (IE files are processed in parallel) , extract pages in parallel (IE every page is processed in parallel), and index the extracted text either in parallel or in batches (lucene is happy with multiple threads indexing, sqlite would rather have me do it sequentially in batches).

The slowest libraries are 100x slower than the fastest to extract text. They cluster, too, so i assume some of them share underlying strategies or code despite my attempt to identify these ahead of time. The current Foxit SDK can extract about 1000-2000 pages per second, sometimes faster, and things like pdfpig, etc can only do about 10 pages per second.

Pdfium would be as fast as the current foxit sdk but it is not thread safe (I assume this is because it's based on a source drop of foxit from before they added thread safety), so all calls are serialized. Even so it can do about 100-200 pages/second.

Memory usage also varies wildly and is uncorrelated with speed (IE there are fast ones that take tons of memory and slow ones that take tons of memory). For native ones, memory usage seems more related to fragmentation than it does it seems related to dumb things. There are, of course, some dumb things (one library creates a new C++ class instance for every letter)

From what i can tell digging into the code that's available, it's all about the amount of work they do up front when loading the file, and then how much time they take to put the text back in content order to give me.

The slowest are doing letter by letter. The fastest are not.

Rendering is similar - some of them are dominated by stupid shit that you notice instantly with a profiler. For example, one of the .net libraries renders to png encoded bitmaps by default, and between it and windows, it spends 300ms to encode/decode it to display. Which is 10x slower than it rasterized it. If i switch it to render to bmp instead, it takes 5ms to encode/decode it (for dumb reasons, the MAUI apis require streams to create drawable images). The difference is very noticeable if i browse through search results using the up/down key.

Anyway, hopefully this helps answer your question and some related ones.

rkagerer|7 months ago

So you've outsourced the parsing to whatever software you're using to render the PDF as an image.

bee_rider|7 months ago

Seems like a fairly reasonable decision given all the high quality implementations out there.

rafram|7 months ago

This has close to zero relevance to the OP.

lovelearning|7 months ago

I think it's a useful insight for people working on RAG using LLMs.

Devs working on RAG have to decide between parsing PDFs or using computer vision or both.

The author of the blog works on PdfPig, a framework to parse PDFs. For its document understanding APIs, it uses a hybrid approach that combines basic image understanding algorithms with PDF metadata . https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Anal...

GP's comment says a pure computer vision approach may be more effective in many real-world scenarios. It's an interesting insight since many devs would assume that pure computer vision is probably the less capable but also more complex approach.

As for the other comments that suggest directly using a parsing library's rendering APIs instead of rasterizing the end result, the reason is that detecting high-level visual objects (like tables , headings, and illustrations) and getting their coordinates is far easier using vision models than trying to infer those structures by examining hundreds of PDF line, text, glyph, and other low-level PDF objects. I feel those commentators have never tried to extract high-level structures from PDF object models. Try it once using PdfBox, Fitz, etc. to understand the difficulty. PDF really is a terrible format!

Alex3917|7 months ago

> This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world.

One of the biggest benefits of PDFs though is that they can contain invisible data. E.g. the spec allows me to embed cryptographic proof that I've worked at the companies I claim to have worked at within my resume. But a vision-based approach obviously isn't going to be able to capture that.

throwaway4496|7 months ago

Cryptographic proof of job experience? Please explain more. Sounds interesting.

bzmrgonz|7 months ago

What software can be used to write and read this invisible data? I want to document continuous edits to published documents which cannot show these edits until they are reviewed, compiled and revised. I was looking at doing this in word, but we keep word and PDF versions of these documents.

cylemons|6 months ago

If that stuff is stored as structured metadata extracting that should be trivial

diptanu|7 months ago

Yeah we don't handle this yet.

MartinMond|6 months ago

Nutrient.io Co-Founder here: We’ve been doing PDF for over 10y. PDF Viewers like Web browsers have to be liberal in what they accept, because PDF has been around for so long, and like with HTML ppl generating files often just iterate until they have something that displays correctly in the one viewer they are testing with.

That’s why we built our AI Document Processing SDK (for PDF files) - basically a REST API service, PDF in, structured data in JSON out. With the experience we have in pre-/post-processing all kinds of PDF files on a structural not just visual basis, we can beat purely vision based approaches on cost/performance: https://www.nutrient.io/sdk/ai-document-processing

If you don’t want to suffer the pain of having to deal with figuring this out yourself and instead focus on your actual use case, that’s where we come in.

hobs|6 months ago

Looks super interesting, except for there's no pricing on the page that I could find except for contact sales - totally understand wanting to do a higher touch sales process, but that's going to bounce some % of eng types who want to try things out but have been bamboozled before.

throwaway4496|7 months ago

This is the parallel of some of the dotcom peak absurdities. We are in the AI peak now.

spankibalt|7 months ago

> "This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world."

Well, to be fair, in many cases there's no way around it anyway since the documents in question are only scanned images. And the hardest problems I've seen there are narrative typography artbooks, department store catalogs with complex text and photo blending, as well as old city maps.

BrandiATMuhkuh|6 months ago

I have started treading everything as images when multimodal LLMs appeared. Even emails. It's so much more robust. Especially emails are often used as a container to send a PDF (e.g. a contract) that contains an image of a contract that was printed. Very very common.

I have just moved my company's RAG indexing to images and multimodal embedding. Works pretty well.

hermitcrab|6 months ago

I would like to add the ability to import data tables from PDF documents to my data wrangling software (Easy Data Transform). But I have no intention of coding it myself. Does anyone know of a good library for this? Needs to be:

-callable from C++

-available for Windows and Mac

-free or reasonable 1-time fee

doe88|6 months ago

I was wondering : is your method ultimately, produces a better parsing than the program you used to initially parse and display the pdf? Or is the value in unifying the parsing for different input parsers?

jiveturkey|7 months ago

Doesn't rendering to an image require proper parsing of the PDF?

cylemons|6 months ago

PDF is more like a glorified svg format than a word format.

It only contains info on how the document should look but so semantic information like sentences, paragraphs, etc. Just a bag of characters positioned in certain places.

throwaway4496|6 months ago

Yes, and don't for a second think this approach of rastering and OCR'ing is sane, let alone a reasonable choice. It is outright absurd.

nurettin|6 months ago

It sounds like a trap coyote would use to catch roadrunner. Does it really have to be so convoluted?

retinaros|6 months ago

I do same but for document search. Colqwen + a VLM like claude.

jlarocco|6 months ago

How ridiculous.

`mutool convert -o <some-txt-file-name.txt> -F text <somefile.pdf>`

Disclaimer: I work at a company that generates and works with PDFs.

throwaway4496|7 months ago

So you parse PDFs, but also OCR images, to somehow get better results?

Do you know you could just use the parsing engine that renders the PDF to get the output? I mean, why raster it, OCR it, and then use AI? Sounds creating a problem to use AI to solve it.

daemonologist|7 months ago

Yes, but a lot of the improvement is coming from layout models and/or multimodal LLMs operating directly on the raster images, as opposed to via classical OCR. This gets better results because the PDF format does not necessarily impart reading order or semantic meaning; the only way to be confident you're reading it like a human would is to actually do so - to render it out.

Another thing is that most document parsing tasks are going to run into a significant volume of PDFs which are actually just a bunch of scans/images of paper, so you need to build this capability anyways.

TL;DR: PDFs are basically steganography

yxhuvud|6 months ago

Well, you clearly hasn't parsed a wide variety of pdfs. Because if you had, you had been exposed to pdfs that contain only images, or those that contain embedded text, but that embedded text is utter nonsense and doesn't match what is shown on the page when rendered.

And that is before we even get into text structure, because as everyone knows, reading text is easier if things like paragraphs, columns and tables are preserved in the output. And guess what, if you just use the parsing engine for that, then what you get out is a garbled mess.

diptanu|7 months ago

We parse PDFs to convert them to text in a linearized fashion. The use case for this would be to use the content for downstream use cases - search engine, structured extraction, etc.

creatonez|6 months ago

While you're doing this, please also tell people to stop producing PDF files in the first place, so that eventually the number of new PDFs can drop to 0. There's no hope for the format ever since manager types decided that it is "a way to put paper in the computer" and not the publishing intermediate format it was actually supposed to be. A vague facsimile of digitization that should have never taken off the way it did.

jeroenhd|6 months ago

PDFs serve their purpose well. Except for some niche open source Linux tools, they render the same way in every application you open them in, in practically every version of that application. Unlike document formats like docx/odf/tex/whatever files that reformat themselves depending on the mood of the computer on the day you open them. And unlike raw image files, you can actually comfortably zoom in and read the text.