top | item 43203148

(no title)

OCR, VLM or LLM for such important use cases seems like a a problem we should not have in 2025.

The real solution would be to have machine readable data embedded in those PDFs, and have the table be built around that data.

We could then we actual machine readable financial statements or reports, much like our passports.

discuss

bayindirh|1 year ago

The problem is, you're coming from paper for these PDFs, and this is the step where you add that data.

While the world became much more digitized (for example, for any sale, I get a PDF and an XML version of my receipt, which is great), but not everything is coming from computers and made for humans.

We have hand written notes, printed documents, etc., and OCR has to solve this. On the other hand, desktop OCR applications like Prizmo and latest versions of macOS already have much better output quality when compared to these models. Also there are specialized free applications to extract tables from PDF files (PDF files are bunch of fonts and pixels, they have no information about layout, tables, etc.).

We have these tools, and they work well. Even there's venerable Tessaract, built to OCR scanned papers and have neural network layer for years. Yet, we still try to throw LLMs to everyhting and we cheer like 5 year olds when it does 20% of these systems, and act like this technology doesn't exist, for two decades.

helloguillecl|1 year ago

The funny thing is that sometimes we need to machine-read documents produced by humans on machines, but the actual source is almost always machine-readable data.

Agree on the hand-written part.

advisedwang|1 year ago

A lot of times you are OCRing documents from people who do not care about how easy it is for the reader to extract data. A common example is regulatory filings - the goal is to comply with the law, not help people read your data. Or perhaps it's from a source that sells the data or has copyright and doesn't want to make it easy for other people to use in ways besides their intention. etc.