top | item 46976129

(no title)

aliljet | 18 days ago

This is actually the thing I really desperately need. I'm routinely analyzing contracts that were faxed to me, scanned with monstrously poor resolution, wet signed, all kinds of shit. The big LLM providers choke on this raw input and I burn up the entire context window for 30 pages of text. Understandable evals of the quality of these OCR systems (which are moving wicked fast) would be helpful...

And here's the kicker. I can't afford mistakes. Missing a single character or misinterpreting it could be catastrophic. 4 units vacant? 10 days to respond? Signature missing? Incredibly critical things. I can't find an eval that gives me confidence around this.

discuss

order

daveguy|18 days ago

If your needs are that sensitive, I doubt you'll find anything anytime soon that doesn't require a human in the loop. Even SOTA models only average 95% accuracy on messy inputs. If that's a per character accuracy (which OCR is generally measured by), that's going to be 5+ errors per page of 100+ words. If you really can't afford mistakes you have to consider the OCR inaccurate. If you have key components like "days to respond" and "units vacant" you need to identify the presence of those specifically with bias in favor of false positives (over false negatives), and human confirmation of the source-> OCR.

kergonath|18 days ago

> If you really can't afford mistakes you have to consider the OCR inaccurate.

Isn’t this close to the error rate of human transcription for messy input, though? I seem to remember a figure in that ballpark. I think if your use case is this sensitive, then any transcription is suspicious.

coder543|18 days ago

If you want OCR with the big LLM providers, you should probably be passing one page per request. Having the model focus on OCR for only a single page at a time seemed to help a lot in my anecdotal testing a few months ago. You can even pass all the pages in parallel in separate requests, and get the better quality response much faster too.

But, as others said, if you can't afford mistakes, then you're going to need a human in the loop to take responsibility.

staticman2|18 days ago

Gemini Pro 3 seems to be built for handling multiple page PDFs.

I can feed it a multiple page PDF and tell it to convert it to markdown and it does this well. I don't need to load the pages one at a time as long as I use the PDF format. (This was tested on A.i. studio but I think the API works the same way).

HPsquared|18 days ago

You could maybe then do a second pass on the whole text (as plain text not OCR) to look for likely mistakes.

renewiltord|18 days ago

I’m sure you’ve tried all this but you’ve tried inter-rater agreement via multiple attempts on same LLM vs different LLM? Perhaps your system would work better if you ran it through 5 models 3 times and then highlighted diffs for human chooser.

chrsw|18 days ago

I'm keeping my eye on progress in this area as well. I need to free engineering design data from tens of thousands of PDF pages and make them easily and quickly accessible to LLMs.

aliljet|18 days ago

All of healthcare is crying. Trust me.

cinntaile|18 days ago

Deciphering fax messages? What is this, the 90s?

kergonath|18 days ago

We have decades of internal reports on film that we’d like to make accessible and searchable. We don’t do it with new documents, but we have a huge backlog.

xyproto|18 days ago

Fax is still hard to hack, so some organizations have kept it alive for security.