The OCR leaderboards I’ve seen leave a lot to be desired.
With the rapid release of so many of these models, I wish there were a better way to know which ones are actually the best.
I also feel like most/all of these models don’t handle charts, other than to maybe include a link to a cropped image. It would be nice for the OCR model to also convert charts into markdown tables, but this is obviously challenging.
I have been trying to catch up with recent OCR developments too. My documents have enough special requirements that public benchmarks didn't tell me enough to decide. Instead I'm building a small document OCR project with visualization tools for comparing bounding boxes, extracted text, region classification, etc. GLM-OCR is my favorite so far [1]. Apple's VisionKit is very good at text recognition, and fast, but it doesn't do high level layout detection and it only works on Apple hardware. It's another useful source of data for cross-validation if you can run it.
This project has been pretty easy to build with agentic coding. It's a Frankenstein monster of glue code and handling my particular domain requirements, so it's not suitable for public release. I'd encourage some rapid prototyping after you've spent an afternoon catching up on what's new. I did a lot of document OCR and post-processing with commercial tools and custom code 15 years ago. The advent of small local VLMs has made it practical to achieve higher accuracy and more domain customization than I would have previously believed.
[1] If you're building an advanced document processing workflow, be sure to read the post-processing code in the GLM code repo. They're doing some non-trivial logic to fuse layout areas and transform text for smooth reading. You probably want to store the raw model results and customize your own post-processing for uncommon languages or uncommon domain vocabulary. Layout is also easier to validate if you bypass their post-processing; it can make some combined areas "disappear" from the layout data.
The best leader board I have used is ocrarena.ai. I agree it is not detailed enough. I wish people could rate what part of the ocr went well or bad (layout, text recognition, etc). However, my more specific results using custom prompts and my own images on their playground page are relatively closely aligned with the rankings as others have voted.
There was so many OCR models released in the past few months, all VLM models and yet none of them handle Korean well. Every time I try with a random screenshot (not a A4 document) they just fail at a "simple" task. And funnily enough Qwen3 8B VL is the best model that usually get it right (although I couldn't get the bbox quite well). Even more funny, whatever is running on an iphone locally on cpu is insanely good, same with google's OCR api. I don't know why we don't get more of the traditional OCR stuff. Paddlepaddle v5 is the closest I could find. At this point, I feel like I might be doing something wrong with those VLMs.
Chrome ships a local OCR model for text extraction from PDFs which is better than any of the VLM or open source OCR models i've tried. I had a few hundred gigs of old newspaper scans and after trying all the other options I ended up building a wrapper around the DLL it uses to get the text and bboxes. Performance and accuracy on another level compared to tesseract, and while VLM models sometimes produced good results they just seemed unreliable.
I've thought of open sourcing the wrapper but havent gotten around to it yet. I bet claude code can build a functioning prototype if you just point it to "screen_ai" dir under chrome's user data.
I remember someone building a meme search engine for millions of images using a cluster of used iPhone SE's because of Apple's very good and fast OCR capabilities.
Quite an interesting read as well:
https://news.ycombinator.com/item?id=34315782
This is actually the thing I really desperately need. I'm routinely analyzing contracts that were faxed to me, scanned with monstrously poor resolution, wet signed, all kinds of shit. The big LLM providers choke on this raw input and I burn up the entire context window for 30 pages of text. Understandable evals of the quality of these OCR systems (which are moving wicked fast) would be helpful...
And here's the kicker. I can't afford mistakes. Missing a single character or misinterpreting it could be catastrophic. 4 units vacant? 10 days to respond? Signature missing? Incredibly critical things. I can't find an eval that gives me confidence around this.
If your needs are that sensitive, I doubt you'll find anything anytime soon that doesn't require a human in the loop. Even SOTA models only average 95% accuracy on messy inputs. If that's a per character accuracy (which OCR is generally measured by), that's going to be 5+ errors per page of 100+ words. If you really can't afford mistakes you have to consider the OCR inaccurate. If you have key components like "days to respond" and "units vacant" you need to identify the presence of those specifically with bias in favor of false positives (over false negatives), and human confirmation of the source-> OCR.
If you want OCR with the big LLM providers, you should probably be passing one page per request. Having the model focus on OCR for only a single page at a time seemed to help a lot in my anecdotal testing a few months ago. You can even pass all the pages in parallel in separate requests, and get the better quality response much faster too.
But, as others said, if you can't afford mistakes, then you're going to need a human in the loop to take responsibility.
I’m sure you’ve tried all this but you’ve tried inter-rater agreement via multiple attempts on same LLM vs different LLM? Perhaps your system would work better if you ran it through 5 models 3 times and then highlighted diffs for human chooser.
I'm keeping my eye on progress in this area as well. I need to free engineering design data from tens of thousands of PDF pages and make them easily and quickly accessible to LLMs.
I tested this pretty extensively and it has a common failure mode that prevents me from using: extracting footnotes and similar from the full text of academic works. For some reason, many of these models are trained in a way that results in these being excluded, despite these document sections often containing import details and context. Both versions of DeepseekOCR have the same problem. Of the others I’ve tested, dot-ocr in layout mode works best (but is slow) and then datalab’s chandra model (which is larger and has bad license constraints).
I can get multiple sets of footnotes (critical + content notes) reliably recognized and categorized using gemini-3-flash-preview. I took 15-20 hours to iterate on my prompt for a specific format. Otherwise it would not produce good enough results. It was a slow process because results from batch did not mirror what I was getting from the chat mode, and you have to wait for batch results while analyzing the last set. There was also a bit of debugging of the batch protocol going on at the same time. Flash is also surprisingly affordable for the results I am getting, 4-5x less than I had anticipated. I gave up on gemini-3-pro pretty quickly because it overthinks and messes things up.
I have been looking for an OCR model that can accurately handle footnotes. It’s essential for processing legal texts in particular, which often have footnotes that break across pages. Sadly I’ve yet to encounter a good solution.
I found Mathpix to be quite good with this type of documents, including footnotes but to be fair my documents did not have that many. It’s also proprietary.
I've been trying different OCR models on what should be very simple - subtitles (these are simple machine-rendered text). While all models do very well (95+% accuracy), I haven't seen a model not occasionally make very obvious mistakes. Maybe it will take a different approach to get the last 1%...
Is it possible for such a small model to outperform gemini 3 or is this a case of benchmarks not showing the reality? I would love to be hopeful, but so far an open source model was never better than a closed one even when benchmarks were showing that.
Off the top of my head: for a lot of OCR tasks, it’s kind of worse for the model to be smart. I don’t want my OCR to make stuff up or answer questions — I want to to recognize what is actually on the page.
Has anyone experiment with using VLM to detect "marks"? Thinking of pen/pencil based markings like underlines, circles,checkmarks.. Can these models do it?
None of them do it well from our experience. We had to write our own custom pipeline with a mixture of legacy CV approaches to handle this (AI contract analysis). We constantly benchmark every new multimodal and VLM model that comes out and are consistently disappointed.
> Option 1: Zhipu MaaS API (Recommended for Quick Start)
> Use the hosted cloud API – no GPU needed.
...
> Option 2: Self-host with vLLM / SGLang
So, first off, this looks really cool and, given I'm looking for OCR at the moment, I'm pretty interested in this and other OCR models.
With that said, the README implies that option 2 requires a GPU. That's fine but it would be incredibly helpful if the README were explicit about requirements, and especially the amount of memory it needs.
EDIT: Looking at the links under option 3, the docs for macOS setup suggest 8GB of unified memory is enough to run the model, which is pretty modest, so I'd imagine Option 2 is similar. Ollama also offers a CPU only option (no idea how that will perform - not amazingly, I'm guessing), but that would suggest to me that if your volume requirements are low and you can't shell out for or source a beefy enough GPU and don't want to pay the sometimes exhorbitant hire costs, you should be able to punt it on to a machine with enough memory to run the model without too much difficulty.
coder543|19 days ago
I’ve also heard very good things about these two in particular:
- LightOnOCR-2-1B: https://huggingface.co/lightonai/LightOnOCR-2-1B
- PaddleOCR-VL-1.5: https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5
The OCR leaderboards I’ve seen leave a lot to be desired.
With the rapid release of so many of these models, I wish there were a better way to know which ones are actually the best.
I also feel like most/all of these models don’t handle charts, other than to maybe include a link to a cropped image. It would be nice for the OCR model to also convert charts into markdown tables, but this is obviously challenging.
philipkglass|19 days ago
This project has been pretty easy to build with agentic coding. It's a Frankenstein monster of glue code and handling my particular domain requirements, so it's not suitable for public release. I'd encourage some rapid prototyping after you've spent an afternoon catching up on what's new. I did a lot of document OCR and post-processing with commercial tools and custom code 15 years ago. The advent of small local VLMs has made it practical to achieve higher accuracy and more domain customization than I would have previously believed.
[1] If you're building an advanced document processing workflow, be sure to read the post-processing code in the GLM code repo. They're doing some non-trivial logic to fuse layout areas and transform text for smooth reading. You probably want to store the raw model results and customize your own post-processing for uncommon languages or uncommon domain vocabulary. Layout is also easier to validate if you bypass their post-processing; it can make some combined areas "disappear" from the layout data.
dweekly|19 days ago
StableAlkyne|19 days ago
I remember that one clearing the scoreboard for many years, and usually it's the one I grab for OCR needs due to its reputation.
sgc|19 days ago
What more are you looking for?
mixedmath|19 days ago
Also, do you have preferred OCR models in your experience? I've had some success with dots.OCR, but I'm only beginning to need to work with OCR.
noahjohannessen|19 days ago
alaanor|19 days ago
Stagnant|19 days ago
I've thought of open sourcing the wrapper but havent gotten around to it yet. I bet claude code can build a functioning prototype if you just point it to "screen_ai" dir under chrome's user data.
ghrl|19 days ago
deaux|19 days ago
aliljet|19 days ago
And here's the kicker. I can't afford mistakes. Missing a single character or misinterpreting it could be catastrophic. 4 units vacant? 10 days to respond? Signature missing? Incredibly critical things. I can't find an eval that gives me confidence around this.
daveguy|19 days ago
coder543|19 days ago
But, as others said, if you can't afford mistakes, then you're going to need a human in the loop to take responsibility.
unknown|19 days ago
[deleted]
unknown|19 days ago
[deleted]
renewiltord|19 days ago
chrsw|19 days ago
cinntaile|19 days ago
bugglebeetle|19 days ago
sgc|19 days ago
droidjj|19 days ago
kergonath|19 days ago
mikae1|19 days ago
EDIT: https://github.com/overcuriousity/pdf2epub looks interesting.
simpleusername|19 days ago
[deleted]
ThrowawayTestr|19 days ago
surfacedamage|19 days ago
simpleusername|19 days ago
ks2048|19 days ago
TZubiri|19 days ago
That doesn't sound great
rdos|19 days ago
amluto|19 days ago
woeirua|19 days ago
sinandrei|19 days ago
leetharris|19 days ago
bartread|19 days ago
...
> Option 2: Self-host with vLLM / SGLang
So, first off, this looks really cool and, given I'm looking for OCR at the moment, I'm pretty interested in this and other OCR models.
With that said, the README implies that option 2 requires a GPU. That's fine but it would be incredibly helpful if the README were explicit about requirements, and especially the amount of memory it needs.
EDIT: Looking at the links under option 3, the docs for macOS setup suggest 8GB of unified memory is enough to run the model, which is pretty modest, so I'd imagine Option 2 is similar. Ollama also offers a CPU only option (no idea how that will perform - not amazingly, I'm guessing), but that would suggest to me that if your volume requirements are low and you can't shell out for or source a beefy enough GPU and don't want to pay the sometimes exhorbitant hire costs, you should be able to punt it on to a machine with enough memory to run the model without too much difficulty.
TZubiri|19 days ago
unknown|19 days ago
[deleted]
raphaelmolly8|19 days ago
[deleted]