top | item 37589177

(no title)

gettalong | 2 years ago

So, I don't think that the first two parts, converting the PDF page to an image to get the text, is necessary. One could just use the basic information in a PDF content stream to get the bounding box for each character. The resulting information can then still be analyzed using the layout analysis algorithm mentioned as step 3.

The information would also be more exact since extracting the character positions from an image depends on the rendering of the PDF to the image (i.e. if an A4 page is rendered at 300ppi or at 600ppi or higher).

discuss

dubbid|2 years ago

The idea is to also generally handle scanned documents as well. Besides sometimes text boxes can get very distorted with whitespace such that the boxes look to a computer very different then they do in new life.

In practice, you are right that this would be more efficient in many cases (not scanned, no weird whitespace), but in practice, the cost of OCR is so low compared to the LLM costs and the relative consistency of OCR outputs helps a lot means that I don't try to handle the PDF object extraction.

gettalong|2 years ago

Fair point :) And yes, some PDFs use weird ways to represent the spacing between words.