top | item 43283411

(no title)

cxie | 1 year ago

The new Mistral OCR release looks impressive - 94.89% overall accuracy and significantly better multilingual support than competitors. As someone who's built document processing systems at scale, I'm curious about the real-world implications.

Has anyone tried this on specialized domains like medical or legal documents? The benchmarks are promising, but OCR has always faced challenges with domain-specific terminology and formatting.

Also interesting to see the pricing model ($1/1000 pages) in a landscape where many expected this functionality to eventually be bundled into base LLM offerings. This feels like a trend where previously encapsulated capabilities are being unbundled into specialized APIs with separate pricing.

I wonder if this is the beginning of the componentization of AI infrastructure - breaking monolithic models into specialized services that each do one thing extremely well.

discuss

themanmaran|1 year ago

Excited to test this our on our side as well. We recently built an OCR benchmarking framework specifically for VLMs[1][2], so we'll do a test run today.

From our last benchmark run, some of these numbers from Mistral seem a little bit optimistic. Side by side of a few models:

model | omni | mistral |

gemini | 86% | 89% |

azure | 85% | 89% |

gpt-4o | 75% | 89% |

google | 68% | 83% |

Currently adding the Mistral API and we'll get results out today!

[1] https://github.com/getomni-ai/benchmark

[2] https://huggingface.co/datasets/getomni-ai/ocr-benchmark

themanmaran|1 year ago

Update: Just ran our benchmark on the Mistral model and results are.. surprisingly bad?

Mistral OCR:

- 72.2% accuracy

- $1/1000 pages

- 5.42s / page

Which is pretty far cry from the 95% accuracy they were advertising from their private benchmark. The biggest thing I noticed is how it skips anything it classifies as an image/figure. So charts, infographics, some tables, etc. all get lifted out and returned as [image](image_002). Compared to the other VLMs that are able to interpret those images into a text representation.

https://github.com/getomni-ai/benchmark

https://huggingface.co/datasets/getomni-ai/ocr-benchmark

https://getomni.ai/ocr-benchmark

unknown|1 year ago

[deleted]

jaggs|1 year ago

By optimistic, do you mean 'tweaked'? :)

epolanski|1 year ago

At my client we want to provide an AI that can retrieve relevant information from documentation (home building business, documents detail how to install a solar panel or a shower, etc) and we've set up an entire system with benchmarks, agents, etc, yet the bottleneck is OCR!

We have millions and millions of pages of documents and an off by 1 % error means it compounds with the AI's own error, which compounds with documentation itself being incorrect at times, which leads it all to be not production ready (and indeed the project has never been released), not even close.

We simply cannot afford to give our customers incorrect informatiin

We have set up a backoffice app that when users have questions, it sends it to our workers along the response given by our AI application and the person can review it, and ideally correct the ocr output.

Honestly after an year of working it feels like AI right now can only be useful when supervised all the time (such as when coding). Otherwise I just find LLMs still too unreliable besides basic bogus tasks.

PeterStuer|1 year ago

As someone who has had a home built, and nearly all my friends and acquaintances report the same thing, having a 1% error on information in this business would mean not a 10x but a 50x improvement over the current practice in the field.

If nobody is supervising building documents all the time during the process, every house would be a pile of rubbish. And even when you do stuff stills creeps in and has to be redone, often more than once.

janalsncm|1 year ago

I have done OCR on leases. It’s hard. You have to be accurate and they all have bespoke formatting.

It would almost be easier to switch everyone to a common format and spell out important entities (names, numbers) multiple times similar to how cheques do.

The utility of the system really depends on the makeup of that last 5%. If problematic documents are consistently predictable, it’s possible to do a second pass with humans. But if they’re random, then you have to do every doc with humans and it doesn’t save you any time.

PeterStuer|1 year ago

I'd love to try it for my domain (regulation), but $1/1000 pages is significantly more expensive than my current local Docling based setup that already does a great job of processing PDF's for my needs.

yawnxyz|1 year ago

I think for regulated fields / high impact fields $1/1000 is well-worth the price; if the accuracy is close to 100% this is way better than using people, who are still error-prone

kbyatnal|1 year ago

re: real world implications, LLMs and VLMs aren't magi, and anyone who goes in expecting 100% automation is in for a surprise (especially in domains like medical or legal).

IMO there's still a large gap for businesses in going from raw OCR outputs —> document processing deployed in prod for mission-critical use cases.

e.g. you still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort.

But for RAG and other use cases where the error tolerance is higher, I do think these OCR models will get good enough to just solve that part of the problem.

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.app/)

kergonath|1 year ago

> Has anyone tried this on specialized domains like medical or legal documents?

I’ll try it on a whole bunch of scientific papers ASAP. Quite excited about this.

janis1234|1 year ago

$1 for 1000 pages seems high to me. Doing a google search

Rent and Reserve NVIDIA A100 GPU 80GB - Pricing Starts from $1.35/hour

I just don't know if in 1 hour and with a A100 I can process more than a 1000 pages. I'm guessing yes.

blackoil|1 year ago

Is the model Open Source/Weight? Else the cost is for the model, not GPU.

salynchnew|1 year ago

Also interesting to see that parts of the training infrastructure to create frontier models is itself being monetized.

stavros|1 year ago

What do you mean by "free"? Using the OpenAI vision API, for example, for OCR is quite a bit more expensive than $1/1k pages.

amelius|1 year ago

> 94.89% overall accuracy

There are about 47 characters on average in a sentence. So does this mean it gets around 2 or 3 mistakes per sentence?

unboxingelf|1 year ago

We’ll just stick LLM Gateway LLM in front of all the specialized LLMs. MicroLLMs Architecture.

cxie|1 year ago

I actually think you're onto something there. The "MicroLLMs Architecture" could mirror how microservices revolutionized web architecture.

Instead of one massive model trying to do everything, you'd have specialized models for OCR, code generation, image understanding, etc. Then a "router LLM" would direct queries to the appropriate specialized model and synthesize responses.

The efficiency gains could be substantial - why run a 1T parameter model when your query just needs a lightweight OCR specialist? You could dynamically load only what you need.

The challenge would be in the communication protocol between models and managing the complexity. We'd need something like a "prompt bus" for inter-model communication with standardized inputs/outputs.

Has anyone here started building infrastructure for this kind of model orchestration yet? This feels like it could be the Kubernetes moment for AI systems.