(no title)
troysk
|
1 year ago
In my experience, this works well but doesn't scale to all kinds of documents.
For scientific papers; it can't render formulas. meta's nougat is the best model to do that.
For invoices and records; donut works better.
Both these models will fail in some cases so you end up running LLM to fix the issues.
Even with that LLM won't be able to do tables and charts justice, as the details were lost during OCR process (bold/italic/other nuances). I feel these might also be "classical" methods.
I have found vision models to be much better as they have the original document/image. Having prompts which are clear helps but still you won't get 100% results as they tend to venture off on their paths.
I believe that can be fixed using fine tuning but no good vision model provides fine tuning for images.
Google Gemini seems to have the feature but I haven't tried it.
Few shots prompting helps keep the LLM from hallucinating, prompt injection and helps adhering to the format requested.
jszymborski|1 year ago
1. Segment document: Identify which part of the document is text, what is an image, what is a formula, what is a table, etc...
2. For text, do OCR + LLM. You can use LLMs to calculate the expectation of the predicted text, and if it is super off, try using ViT or something to OCR.
3. For tables, you can get a ViT/CNN to identify the cells to recover positional information, and then OCR + LLM for recovering the contents of cells
4. For formulas (and formulas in tables), just use a ViT/CNN.
5. For images, you can get a captioning ViT/CNN to caption the photo, if that's desired.
ozim|1 year ago
troysk|1 year ago
I prefer to do all of this in 1 step with an LLM with a good prompt and few shots.
With so many passes with images, the costs/time will be high with ViT being slower.
vintermann|1 year ago
What I want to do is reading handwritten documents from the 18th century, and I feel like the multistep approach hits a hard ceiling there. Transkribus is multistep, but the line detecion model is just terrible. Things that should be easy, such as printed schemas, utterly confuse it. You simply need to be smart about context to a much higher degree than you need in OCR of typewriter-written text.
huijzer|1 year ago
In this case, the model can already do the OCR and becomes an order of magnitude cheaper per year.
troysk|1 year ago
ChadNauseam|1 year ago
troysk|1 year ago
Mathpix Markdown however is awesome and I ask LLMs to use that to denote formulas as latex is tricky to render in HTML because of things not matching. I don't know latex well so haven't gone deeper on it.
EarlyOom|1 year ago
troysk|1 year ago
aaron695|1 year ago
[deleted]