(no title)
cxie | 1 year ago
Has anyone tried this on specialized domains like medical or legal documents? The benchmarks are promising, but OCR has always faced challenges with domain-specific terminology and formatting.
Also interesting to see the pricing model ($1/1000 pages) in a landscape where many expected this functionality to eventually be bundled into base LLM offerings. This feels like a trend where previously encapsulated capabilities are being unbundled into specialized APIs with separate pricing.
I wonder if this is the beginning of the componentization of AI infrastructure - breaking monolithic models into specialized services that each do one thing extremely well.
themanmaran|1 year ago
From our last benchmark run, some of these numbers from Mistral seem a little bit optimistic. Side by side of a few models:
model | omni | mistral |
gemini | 86% | 89% |
azure | 85% | 89% |
gpt-4o | 75% | 89% |
google | 68% | 83% |
Currently adding the Mistral API and we'll get results out today!
[1] https://github.com/getomni-ai/benchmark
[2] https://huggingface.co/datasets/getomni-ai/ocr-benchmark
themanmaran|1 year ago
Mistral OCR:
- 72.2% accuracy
- $1/1000 pages
- 5.42s / page
Which is pretty far cry from the 95% accuracy they were advertising from their private benchmark. The biggest thing I noticed is how it skips anything it classifies as an image/figure. So charts, infographics, some tables, etc. all get lifted out and returned as [image](image_002). Compared to the other VLMs that are able to interpret those images into a text representation.
https://github.com/getomni-ai/benchmark
https://huggingface.co/datasets/getomni-ai/ocr-benchmark
https://getomni.ai/ocr-benchmark
unknown|1 year ago
[deleted]
jaggs|1 year ago
epolanski|1 year ago
We have millions and millions of pages of documents and an off by 1 % error means it compounds with the AI's own error, which compounds with documentation itself being incorrect at times, which leads it all to be not production ready (and indeed the project has never been released), not even close.
We simply cannot afford to give our customers incorrect informatiin
We have set up a backoffice app that when users have questions, it sends it to our workers along the response given by our AI application and the person can review it, and ideally correct the ocr output.
Honestly after an year of working it feels like AI right now can only be useful when supervised all the time (such as when coding). Otherwise I just find LLMs still too unreliable besides basic bogus tasks.
PeterStuer|1 year ago
If nobody is supervising building documents all the time during the process, every house would be a pile of rubbish. And even when you do stuff stills creeps in and has to be redone, often more than once.
janalsncm|1 year ago
It would almost be easier to switch everyone to a common format and spell out important entities (names, numbers) multiple times similar to how cheques do.
The utility of the system really depends on the makeup of that last 5%. If problematic documents are consistently predictable, it’s possible to do a second pass with humans. But if they’re random, then you have to do every doc with humans and it doesn’t save you any time.
PeterStuer|1 year ago
yawnxyz|1 year ago
kbyatnal|1 year ago
IMO there's still a large gap for businesses in going from raw OCR outputs —> document processing deployed in prod for mission-critical use cases.
e.g. you still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort.
But for RAG and other use cases where the error tolerance is higher, I do think these OCR models will get good enough to just solve that part of the problem.
Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.app/)
kergonath|1 year ago
I’ll try it on a whole bunch of scientific papers ASAP. Quite excited about this.
janis1234|1 year ago
Rent and Reserve NVIDIA A100 GPU 80GB - Pricing Starts from $1.35/hour
I just don't know if in 1 hour and with a A100 I can process more than a 1000 pages. I'm guessing yes.
blackoil|1 year ago
salynchnew|1 year ago
stavros|1 year ago
amelius|1 year ago
There are about 47 characters on average in a sentence. So does this mean it gets around 2 or 3 mistakes per sentence?
unboxingelf|1 year ago
cxie|1 year ago
Instead of one massive model trying to do everything, you'd have specialized models for OCR, code generation, image understanding, etc. Then a "router LLM" would direct queries to the appropriate specialized model and synthesize responses.
The efficiency gains could be substantial - why run a 1T parameter model when your query just needs a lightweight OCR specialist? You could dynamically load only what you need.
The challenge would be in the communication protocol between models and managing the complexity. We'd need something like a "prompt bus" for inter-model communication with standardized inputs/outputs.
Has anyone here started building infrastructure for this kind of model orchestration yet? This feels like it could be the Kubernetes moment for AI systems.