Donut: OCR-Free Document Understanding Transformer

[+] AmazingTurtle|2 years ago|reply

I tested it out with a bunch of personal documents. Results were disappointing. Did not match up with the promised scores, not even slightly.

I think the traditional approach to scanning and classifying without AI/ML is the way to go, for the next 5 years at very least.

[+] jstummbillig|2 years ago|reply

https://cloud.google.com/use-cases/ocr

For my use cases, this already beats all "traditional approaches" for at least a few month now. That's just inferring from when I first stumbled across it. No clue for how long it's been a thing.

[+] nogridbag|2 years ago|reply

What would you recommend for classifying documents? Most of the companies I've evaluated market their product as using fancy AI/ML, but instead they have hundreds of people, usually in India, manually classifying the documents.

[+] paddw|2 years ago|reply

For documents which are mostly pretty clean you are probably right. The ceiling for AI/ML is definitely higher though, and very useful right now if you know specifically what type of document you expect to look at, but expect it to be messy.

[+] mhitza|2 years ago|reply

Do you (or anyone else) know which would be a good open source OCR for PDFs and images?

[+] loudmax|2 years ago|reply

Developments in this space are coming really fast, and reading words are squarely within the capabilities of neural engines. 5 years is a very long time in AI years.

[+] jrpt|2 years ago|reply

How does it compare to something like https://docalysis.com?

[+] iamflimflam1|2 years ago|reply

I think the online demos have been fine tunes to work on receipts.

[+] jameshart|2 years ago|reply

Five years? Where’s that number coming from?

This seems like exactly the kind of problem that will see rapid improvements as people point more LLMs at multimodal input.

Right now making predictions for ML capabilities on a five year timeframe seems foolhardy.

[+] dkatz23238|2 years ago|reply

As a developer who has been building IDP solutions I can assert that although this model is a lot larger (more weights) than a Graph Neural Network on OCR tokens, industry standard before transformers, it outperforms given enough data. Depending on how heterogenous the data is usually 200 documents can reach human levels of accuracy on documents, scoring by levenshtein ratio.

Smaller graph models could get away with using less data. The problem that the "traditional" approach had is the the quality of the OCR was the bottleneck for overall model performance. It amazes me how this problem shifted from a node classification problem to a image to text problem.

Training on CPU was possible with GCN but not with Donut.

[+] dowakin|2 years ago|reply

If you want to train the Donut, check out this notebook on Kaggle. It trains Donut to read plots for a competition. The notebook contains full pipeline for finetuning. https://www.kaggle.com/code/nbroad/donut-train-benetech

[+] armchairhacker|2 years ago|reply

These OCR tools are bringing us closer to msPaint as a viable IDE

[+] tkanarsky|2 years ago|reply

> Donut: DOcumeNt Understanding Transformer

Author: phew! I'm glad there's an 'n' in there somewhere

[+] saretup|2 years ago|reply

Is there an online tool/piece of code that can find words like this in a string?

[+] sebzim4500|2 years ago|reply

As AI goes that isn't too bad. See LION = evoLved sIgn mOmeNtum, which I have to assume is parody.

[+] xavriley|2 years ago|reply

There’s a model for music transcription (audio to midi) called MT3 which takes an end-to-end transformer approach and claims SOTA on some datasets. However, from my own research and comparing with other models it seems that MT3 is very prone to overfitting and the real world results are not as impressive. A similar story seems to be playing out in the comments here

[+] onnodigcomplex|2 years ago|reply

What would you say is a good model for audio to midi transcription?

[+] vosper|2 years ago|reply

I want to build an application that scans restaurant and café menus (PDFs, photos, webpages) to identify which items are vegetarian or vegan. Would this work for that? If not, I would love to hear peoples ideas and suggestions.

[+] kolinko|2 years ago|reply

With vegan you can’t estimate it 100% from menu alone - because the sauce and other minor ingredients can be animal based.

If you want to do it, using “plant based” is probably better than “vegan”, and it’s always good to make sure your users are aware that the mark can be wrong and they should double-check with the waiter.

As for your question - I didn’t play with Donut, but ocr+gpt or multimodal gpt4 once released should handle this smoothly.

[+] tuardoui|2 years ago|reply

You should look at LayoutLM models for a NER task. Then your pipeline should look like : - Identity the menu sub structure (title, item list ...) - Classify each item with 2 labels.

The training process is not hard, but the data gathering / cleaning / labelling can be a little long.

[+] 7moritz7|2 years ago|reply

The non quantized models look relatively large, 800 MB. So you'd probably need to do inference on a server and somehow monetize that. Sounds difficult

[+] Alifatisk|2 years ago|reply

We have a similar idea, but mine includes a few other categories!

[+] nestorD|2 years ago|reply

I will have to investigate this, I am dreaming of a system that can take a pdf scan of a book as input and produce one or more properly formated (headings, italic, bold, underline, etc) markdown files. In my tests, LLMs have proved very good at cleaning a raw OCR but they need formating information to get me all the way.

[+] jcuenod|2 years ago|reply

It's not ready to take a book, but I'm building an app that takes scans of book chapters/journal articles (which I often receive from my college library) and turns them into well formatted PDFs (with OCR, consistent margins, rotation...) https://fixpdfs.com

[+] ryanjshaw|2 years ago|reply

This is really cool if it delivers. I tried building an app to scan till receipts. The image to text APIs out there really don't perform as well as you'd think. AWS Text Extract performed far better than GCP and Azure equivalents and traditional OCR solutions, but it still made some really annoying errors that I had to fix with heuristics.

[+] DannyBee|2 years ago|reply

Unfortunately, trying this out, it seems to be nowhere near the claimed quality. Definitely not ready for prime time.

Feels like someone trying to throw a stake in the ground rather than releasing a quality product, honestly.

[+] unknown|2 years ago|reply

[deleted]

[+] werdnapk|2 years ago|reply

Was Tesseract one of the APIs you tried?

[+] aosmith|2 years ago|reply

So is this why IA had an outage? Timing is perfect.

[+] i2cmaster|2 years ago|reply

I've started using Microsoft's TROCR (another transformer OCR model) to read the cursive in my pocket journal (I have a habit of writing programs there first while I'm out and then typing them in manually, I just focus better that way.)

It's surprisingly accurate although you have to write your own program to segment the image into lines. I think with some fine tuning I could have the machine read my notebook with minimal corrections.

[+] driscoll42|2 years ago|reply

Have you looked into Craft or EAST for segmenting the image into lines? Those two work decently.

[+] siddiqi123|2 years ago|reply

[deleted]

86 comments