Our search for the best OCR tool (2019)

[+] simonw|3 years ago|reply

I had spectacular results from AWS Textract recently - which when this article was written (2019) wasn't yet openly available.

I fed it thousands of pages of historical scanned documents - including handwritten journals from the 1800s - and it could read them better than I could!

I built a tool to use it (since running it in bulk against PDFs in a bucket took a few too many steps) and wrote about my experiences with it here: https://simonwillison.net/2022/Jun/30/s3-ocr/

[+] amelius|3 years ago|reply

Great that it works for you, but I'm not too happy about big companies assuming that my product is connected to the internet.

[+] driscoll42|3 years ago|reply

For some comparison, I recently did an OCR comparison for some work for a professor. To set some context, all documents were 1960s era typed or handwritten documents in English, specifically from this archive - http://allenarchive.iac.gatech.edu/. I hand transcribed 50 documents to use as a base comparison and ran them through the various OCR engines getting the results below.

                           Overall       Typed  Handwritten
  OCR Engine          Leven   Cosine  Leven   Cosine  Leven   Cosine
  Amazon Textract     91.63%  98.14%  92.07%  98.76%  87.99%  92.10%
  Google Vision       93.05%  97.97%  93.84%  98.99%  85.86%  88.11%
  Microsoft Azure     80.32%  95.61%  80.65%  96.20%  79.14%  90.21%
  TrOCR               78.66%  93.97%  80.64%  96.65%  59.96%  67.89%
  PaddleOCR           84.82%  90.73%  88.60%  96.28%  49.64%  37.58%
  Tesseract           86.67%  89.53%  91.14%  95.63%  44.54%  31.39%
  Easy OCR            81.79%  85.07%  85.50%  91.89%  46.87%  19.23%
  Keras OCR           58.03%  83.57%  59.32%  89.98%  46.08%  21.20%

Leven is Levenshtein Distance. Overall is a weighted average of typed vs handwritten, 90/10 if I recall correctly. All results were run on my personal machine with a 5950X, 128 GB RAM, and a RTX 3080.

From my analysis the Amazon Textract was excellent, the best of all the paid ones, and while TrOCR and PaddleOCR were the best FOSS ones, the issue with them is that they require a GPU while Tesseract I could use on CPU alone. For instance to OCR all 50 documents.

  Tessearct       1:19
  TrOCR (GPU)    27:33
  TrOCR (CPU)  3:04:22

TrOCR is great if you need to do a few or have GPUs to burn, but Tesseract is by far better if you need good enough for a large volume of documents, and for my project the intent was to make a software plugin that could be sent to libraries/universities, CPU is king.

[+] llaolleh|3 years ago|reply

This would probably make a good blog post!

[+] bufo|3 years ago|reply

The iOS / Apple OCR Swift API is drastically better than the ones I’ve tried online (eg. Microsoft) or the open source ones (Tesseract). Highly recommended. You can get fairly high throughput with M1 chips. The CNN is accelerated by the neural chip and the language model is accelerated by the GPU.

[+] bee_rider|3 years ago|reply

I'm not sure if it is related, but I noticed recently while taking a photo of some really poorly lit text that the iPhone camera managed to pick it up and enhance it into legibility. Impressive feature and nice attention to detail on their part.

[+] pronoiac|3 years ago|reply

I went looking for a similar comparison a few months ago, and saw this: https://research.aimultiple.com/ocr-accuracy/ It compared ABBYY FineReader 15, Amazon Textract, Google Cloud Platform Vision API, Microsoft Azure Computer Vision API, and Tesseract OCR Engine. I ended up using OCRmyPDF / Tesseract out of convenience, but doing a second pass with Google Cloud Vision, AWS Textract, or Abbyy is somewhere on my to-do list.

[+] ce4|3 years ago|reply

I went with this instead of OCRmyPDF:

https://gitlab.com/kebekus/scantools/

[+] mcswell|3 years ago|reply

Several years ago, we did a project attempting to develop methods to OCR bilingual dictionaries. We just used Tesseract, because we were trying to develop methods to put stuff into particular fields (headword, part of speech, translations etc.), not compare OCR methods. As you might guess, there were lots of problems. But what really surprised me was that it was completely inaccurate in detecting bold characters--whereas I could detect bolding while standing far enough away from an image that I couldn't make out individual characters. And bold detection was crucial for parsing out some of the fields. (A more recent version of Tesseract doesn't even try to detect bold, afaict.)

We had another project later on aimed at simply detecting bold text, some success. But very little literature on this topic. Anyone know of OCR tools that do detect bolding?

[+] nicodjimenez|3 years ago|reply

For STEM applications, nothing beats Mathpix OCR.

FB Research uses it, London Stock Exchange uses it, Chegg uses it (in fact even recently transitioned to Mathpix OCR from Google vision), and many, many other companies and individuals.

Disclaimer: I'm the founder.

[+] albatrosstrophy|3 years ago|reply

How about foreign languages? I've never had one good enough for Arabic. 3 years ago when needed it for a project, no OCR I found could read a properly scanned Arabic page. Had to go on Fiverr and paid a transcriber instead.

[+] bhaney|3 years ago|reply

I used ABBYY Finereader around 8 years ago to OCR an old EE textbook, and I was really impressed with the results back then. I haven't heard any mention of the company since then until now, so it's interesting to see that they still seem to have some of the best available OCR tech. I've since tried to use Tesseract for small OCR jobs several times over the last few years, and have never found its results to be even remotely usable (which is a real shame).

[+] noodlesUK|3 years ago|reply

What I really want is something with a similar set of convenient APIs and CLIs like ocrmypdf [1] that supports some of the more recent ML based systems. Ocrmypdf has really good ergonomics for me in terms of ease of scripting.

Something like DocTR [2] with the same api would be fantastic.

[1] https://ocrmypdf.readthedocs.io/en/latest/

[2] https://mindee.github.io/doctr/

[+] iisan7|3 years ago|reply

What do folks think about these document types as a corpus for comparing tools? It's missing images and handwriting samples, but those types of documents might just be too variable to make conclusions about.

I remember Baidu's OCR giving excellent English results, but it looks like their API is deprecated now. Out of curiousity, I ran these samples through easyOCR by JaidedAI. Results at https://pastebin.com/RjzVd5Sf.

[+] sgc|3 years ago|reply

I OCR books, so they are not a good sample. I would want to compare at least 10 pages per sample, with more typical problems such as skewed, rounded pages from photos, artifacts, damaged source pages (tears and creases) etc. They do reproduce some problems with changing fonts and layout, but a big piece of the puzzle is custom dictionaries and layout training. It's fine for a once over, but not a deep dive.

[+] longrod|3 years ago|reply

Found this comparison while researching OCR. It doesn't have the latest libraries like PaddleOCR but the performance of different OCR libraries is still quite apparent.

[+] ducktective|3 years ago|reply

I think among the easy-to-use FOSS CLI tools, the competition is between Tesseract and PaddlePaddle. I'm interested to know how they fare against each other. I'm mainly interested in using them in `ocrmypdf`

[+] Rochus|3 years ago|reply

Interesting report. As far as I understand in total no of the systems was really better in all categories (or did I miss something?). A summary would have been helpful. Also it would be interesting whether the neural network based or the traditional Tesseract engine was used. I did similar experiments for a project six years ago and ended up with Tesseract and a custom traineddata file.

[+] nammi|3 years ago|reply

Does anyone use OCR to convert BluRay subtitles (.sup) to plaintext .srt files? I've used tools like SupRip and BDSup2Sub, but they've all required pretty significant cleanup afterwards. 'l', '1', and 'I' especially get mixed up a lot

[+] leokennis|3 years ago|reply

Assuming all these subtitles (at least per movie) are in the same font, isn’t it enough to tell/correct it once that “this is an I” and “this is a 1”, and then it knows for the entire .sup?

[+] solardev|3 years ago|reply

Wouldn't it be easier to just find it on www.opensubtitles.org?

31 comments