So an invoice attached to an email as a PDF is sent digitally ... those unfamiliar with PDF will think text and data extraction is trivial then, but this isn't true. You can have a fully digital, non-image PDF that is vector based and has what looks like text, but doesn't have a single piece of extractable text in it. It's all about how the PDF was generated. Tables can be formatted in a million ways, etc.Your best bet is to always convert it to an image and OCR it to extract structured data.
merb|1 year ago