(no title)
hamsterbooster | 1 year ago
There are some elements that might resemble Dagster, but I believe the challenging part is constructing validation systems that ensure high accuracy and correct schemas while processing all kinds of complex PDFs and document edge cases. Over the past few weeks, our engineering team has spent a lot of time developing a vision model robust enough to extract nested tables from documents
visarga|1 year ago
In critical scenarios companies won't risk using 100% automation, the human is still in the loop, so the cost doesn't go down much.
I work on LLM based information extraction and use my own evaluation sets. That's how I obtained the 90% score. I tested on many document types. It looks like it's magic when you try an invoice in GPT-4o and skim the outputs, but if you spend 15 minutes you find issues.
Can you risk an OCR error confusing a dot for a comma to send 1000x more money in a bank transfer, or to get the medical data extraction wrong and someone could suffer because there was no human in the document ingestion pipeline to see what is happening?
unknown|1 year ago
[deleted]