For my use cases, this already beats all "traditional approaches" for at least a few month now. That's just inferring from when I first stumbled across it. No clue for how long it's been a thing.
What would you recommend for classifying documents? Most of the companies I've evaluated market their product as using fancy AI/ML, but instead they have hundreds of people, usually in India, manually classifying the documents.
For documents which are mostly pretty clean you are probably right. The ceiling for AI/ML is definitely higher though, and very useful right now if you know specifically what type of document you expect to look at, but expect it to be messy.
Developments in this space are coming really fast, and reading words are squarely within the capabilities of neural engines. 5 years is a very long time in AI years.
As a developer who has been building IDP solutions I can assert that although this model is a lot larger (more weights) than a Graph Neural Network on OCR tokens, industry standard before transformers, it outperforms given enough data. Depending on how heterogenous the data is usually 200 documents can reach human levels of accuracy on documents, scoring by levenshtein ratio.
Smaller graph models could get away with using less data. The problem that the "traditional" approach had is the the quality of the OCR was the bottleneck for overall model performance. It amazes me how this problem shifted from a node classification problem to a image to text problem.
Training on CPU was possible with GCN but not with Donut.
If you want to train the Donut, check out this notebook on Kaggle. It trains Donut to read plots for a competition. The notebook contains full pipeline for finetuning. https://www.kaggle.com/code/nbroad/donut-train-benetech
There’s a model for music transcription (audio to midi) called MT3 which takes an end-to-end transformer approach and claims SOTA on some datasets. However, from my own research and comparing with other models it seems that MT3 is very prone to overfitting and the real world results are not as impressive. A similar story seems to be playing out in the comments here
I want to build an application that scans restaurant and café menus (PDFs, photos, webpages) to identify which items are vegetarian or vegan. Would this work for that? If not, I would love to hear peoples ideas and suggestions.
With vegan you can’t estimate it 100% from menu alone - because the sauce and other minor ingredients can be animal based.
If you want to do it, using “plant based” is probably better than “vegan”, and it’s always good to make sure your users are aware that the mark can be wrong and they should double-check with the waiter.
As for your question - I didn’t play with Donut, but ocr+gpt or multimodal gpt4 once released should handle this smoothly.
You should look at LayoutLM models for a NER task. Then your pipeline should look like :
- Identity the menu sub structure (title, item list ...)
- Classify each item with 2 labels.
The training process is not hard, but the data gathering / cleaning / labelling can be a little long.
I will have to investigate this, I am dreaming of a system that can take a pdf scan of a book as input and produce one or more properly formated (headings, italic, bold, underline, etc) markdown files.
In my tests, LLMs have proved very good at cleaning a raw OCR but they need formating information to get me all the way.
It's not ready to take a book, but I'm building an app that takes scans of book chapters/journal articles (which I often receive from my college library) and turns them into well formatted PDFs (with OCR, consistent margins, rotation...) https://fixpdfs.com
This is really cool if it delivers. I tried building an app to scan till receipts. The image to text APIs out there really don't perform as well as you'd think. AWS Text Extract performed far better than GCP and Azure equivalents and traditional OCR solutions, but it still made some really annoying errors that I had to fix with heuristics.
I've started using Microsoft's TROCR (another transformer OCR model) to read the cursive in my pocket journal (I have a habit of writing programs there first while I'm out and then typing them in manually, I just focus better that way.)
It's surprisingly accurate although you have to write your own program to segment the image into lines. I think with some fine tuning I could have the machine read my notebook with minimal corrections.
[+] [-] AmazingTurtle|2 years ago|reply
I think the traditional approach to scanning and classifying without AI/ML is the way to go, for the next 5 years at very least.
[+] [-] jstummbillig|2 years ago|reply
For my use cases, this already beats all "traditional approaches" for at least a few month now. That's just inferring from when I first stumbled across it. No clue for how long it's been a thing.
[+] [-] nogridbag|2 years ago|reply
[+] [-] paddw|2 years ago|reply
[+] [-] mhitza|2 years ago|reply
[+] [-] loudmax|2 years ago|reply
[+] [-] jrpt|2 years ago|reply
[+] [-] iamflimflam1|2 years ago|reply
[+] [-] jameshart|2 years ago|reply
This seems like exactly the kind of problem that will see rapid improvements as people point more LLMs at multimodal input.
Right now making predictions for ML capabilities on a five year timeframe seems foolhardy.
[+] [-] dkatz23238|2 years ago|reply
Smaller graph models could get away with using less data. The problem that the "traditional" approach had is the the quality of the OCR was the bottleneck for overall model performance. It amazes me how this problem shifted from a node classification problem to a image to text problem.
Training on CPU was possible with GCN but not with Donut.
[+] [-] dowakin|2 years ago|reply
[+] [-] armchairhacker|2 years ago|reply
[+] [-] tkanarsky|2 years ago|reply
Author: phew! I'm glad there's an 'n' in there somewhere
[+] [-] saretup|2 years ago|reply
[+] [-] sebzim4500|2 years ago|reply
[+] [-] xavriley|2 years ago|reply
[+] [-] onnodigcomplex|2 years ago|reply
[+] [-] vosper|2 years ago|reply
[+] [-] kolinko|2 years ago|reply
If you want to do it, using “plant based” is probably better than “vegan”, and it’s always good to make sure your users are aware that the mark can be wrong and they should double-check with the waiter.
As for your question - I didn’t play with Donut, but ocr+gpt or multimodal gpt4 once released should handle this smoothly.
[+] [-] tuardoui|2 years ago|reply
The training process is not hard, but the data gathering / cleaning / labelling can be a little long.
[+] [-] 7moritz7|2 years ago|reply
[+] [-] Alifatisk|2 years ago|reply
[+] [-] nestorD|2 years ago|reply
[+] [-] jcuenod|2 years ago|reply
[+] [-] ryanjshaw|2 years ago|reply
[+] [-] DannyBee|2 years ago|reply
Feels like someone trying to throw a stake in the ground rather than releasing a quality product, honestly.
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] werdnapk|2 years ago|reply
[+] [-] aosmith|2 years ago|reply
[+] [-] i2cmaster|2 years ago|reply
It's surprisingly accurate although you have to write your own program to segment the image into lines. I think with some fine tuning I could have the machine read my notebook with minimal corrections.
[+] [-] driscoll42|2 years ago|reply
[+] [-] siddiqi123|2 years ago|reply
[deleted]