top | item 43285060

(no title)

revnode | 1 year ago

So an invoice attached to an email as a PDF is sent digitally ... those unfamiliar with PDF will think text and data extraction is trivial then, but this isn't true. You can have a fully digital, non-image PDF that is vector based and has what looks like text, but doesn't have a single piece of extractable text in it. It's all about how the PDF was generated. Tables can be formatted in a million ways, etc.

Your best bet is to always convert it to an image and OCR it to extract structured data.

discuss

merb|1 year ago

This is simply not true. Maybe it’s easier and you do not need 100% precision. But it is actually possible to extract text and layout of digital pdfs. Else it would be impossible to display it. Of course some people still add image fragments to a pdf, but that practice is basically dying. I did not see a single pdf the last year we‘re it was impossible to extract the layout.