top | item 46977441

(no title)

kergonath | 18 days ago

Tesseract does not understand layout. It’s fine for character recognition, but if I still have to pipe the output to a LLM to make sense of the layout and fix common transcription errors, I might as well use a single model. It’s also easier for a visual LLM to extract figures and tables in one pass.

discuss

order

chaps|18 days ago

For my workflows, layout extraction has been so inconsistent that I've stopped attempting to use it. It's simpler to just throw everything into postgis and run intersection checks on size-normalized pages.

kergonath|18 days ago

Interesting. What kind of layout do you have?

My documents have one or two-column layouts, often inconsistently across pages or even within a page (which tripped older layout detection methods). Most models seem to understand that well enough so they are good enough for my use case.

fudged71|18 days ago

I don't know how, but PyMuPDF4LLM is based on Tessaract and has GNN-based layout detection