top | item 46976224

(no title)

coder543 | 18 days ago

There are a bunch of new OCR models.

I’ve also heard very good things about these two in particular:

- LightOnOCR-2-1B: https://huggingface.co/lightonai/LightOnOCR-2-1B

- PaddleOCR-VL-1.5: https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5

The OCR leaderboards I’ve seen leave a lot to be desired.

With the rapid release of so many of these models, I wish there were a better way to know which ones are actually the best.

I also feel like most/all of these models don’t handle charts, other than to maybe include a link to a cropped image. It would be nice for the OCR model to also convert charts into markdown tables, but this is obviously challenging.

discuss

philipkglass|18 days ago

I have been trying to catch up with recent OCR developments too. My documents have enough special requirements that public benchmarks didn't tell me enough to decide. Instead I'm building a small document OCR project with visualization tools for comparing bounding boxes, extracted text, region classification, etc. GLM-OCR is my favorite so far [1]. Apple's VisionKit is very good at text recognition, and fast, but it doesn't do high level layout detection and it only works on Apple hardware. It's another useful source of data for cross-validation if you can run it.

This project has been pretty easy to build with agentic coding. It's a Frankenstein monster of glue code and handling my particular domain requirements, so it's not suitable for public release. I'd encourage some rapid prototyping after you've spent an afternoon catching up on what's new. I did a lot of document OCR and post-processing with commercial tools and custom code 15 years ago. The advent of small local VLMs has made it practical to achieve higher accuracy and more domain customization than I would have previously believed.

[1] If you're building an advanced document processing workflow, be sure to read the post-processing code in the GLM code repo. They're doing some non-trivial logic to fuse layout areas and transform text for smooth reading. You probably want to store the raw model results and customize your own post-processing for uncommon languages or uncommon domain vocabulary. Layout is also easier to validate if you bypass their post-processing; it can make some combined areas "disappear" from the layout data.

dweekly|18 days ago

I'm going to be the obnoxious person who asks you to please create this leaderboard because you care and have a modicum of knowledge in this space.

StableAlkyne|18 days ago

How do these compare to something like Tesseract?

I remember that one clearing the scoreboard for many years, and usually it's the one I grab for OCR needs due to its reputation.

kergonath|18 days ago

Tesseract does not understand layout. It’s fine for character recognition, but if I still have to pipe the output to a LLM to make sense of the layout and fix common transcription errors, I might as well use a single model. It’s also easier for a visual LLM to extract figures and tables in one pass.

chaps|18 days ago

Tesseract v4 when it was released was exceptionally good and blew everything out of the water. Have used it to OCR millions of pages. Tbh, I miss the simplicity of tesseract.

The new models are similarly better compared to tesseract v4. But what I'll say is that don't expect new models to be a panacea for your OCR problems. The edge case problems that you might be trying to solve (like, identifying anchor points, or identifying shared field names across documents) are still pretty much all problematic still. So you should still expect things like random spaces or unexpected characters to jam up your jams.

Also some newer models tend to hallucinate incredibly aggressively. If you've ever seen an LLM get stuck in an infinite, think of that.

unknown|18 days ago

[deleted]

sgc|18 days ago

The best leader board I have used is ocrarena.ai. I agree it is not detailed enough. I wish people could rate what part of the ocr went well or bad (layout, text recognition, etc). However, my more specific results using custom prompts and my own images on their playground page are relatively closely aligned with the rankings as others have voted.

What more are you looking for?

mixedmath|18 days ago

Are there leaderboards that you follow or trust?

Also, do you have preferred OCR models in your experience? I've had some success with dots.OCR, but I'm only beginning to need to work with OCR.

coder543|18 days ago

> Are there leaderboards that you follow or trust?

Not for OCR.

Regardless of how much some people complain about them, I really do appreciate the effort Artificial Analysis puts into consistently running standardized benchmarks for LLMs, rather than just aggregating unverified claims from the AI labs.

I don't think LMArena is that amazing at this point in time, but at least they provide error bars on the ELO and give models the same rank number when they're overlapping.

> Also, do you have preferred OCR models in your experience?

It's a subject I'm interested in, but I don't have enough experience to really put out strong opinions on specific models.

noahjohannessen|18 days ago

is https://www.ocrarena.ai/ not accurate?

fzysingularity|18 days ago

ELO scores for OCR don't really make much sense - it's trying to reduce accuracy to a single voting score without any real quality-control on the reviewer/judge.

I think a more accurate reflection of the current state of comparisons would be a real-world benchmark with messy/complex docs across industries, languages.

coder543|18 days ago

It is missing both models that I mentioned, so yes, I would say one reason it is not accurate is because it is so incomplete.

It also doesn't provide error bars on the ELO, so models that only have tens of battles are being listed alongside models that have thousands of battles with no indication of how confident those ELOs are, which I find rather unhelpful.

A lot of these models are also sensitive to how they are used, and offer multiple ways to be used. It's not clear how they are being invoked.

That leaderboard is definitely one of the ones that leaves a lot to be desired.