top | item 41053840

(no title)

ipkstef | 1 year ago

I think i'm missing something.. why would i pay to ocr the images when i can do it locally for free? Tesseract runs pretty well on just cpu, wouldn't even need something crazy powerful.

discuss

order

daemonologist|1 year ago

Tesseract works great for pure label-the-characters OCR, which is sufficient for books and other sources with straightforward layouts, but doesn't handle weird layouts (tables, columns, tables with columns in each cell, etc.) People will do absolutely depraved stuff with Word and PDF documents and you often need semantic understanding to decipher it.

That said, sometimes no amount of understanding will improve the OCR output because a structure in a document cannot be converted to a one-dimensional string (short of using HTML/CSS or something). Maybe we'll get image -> HTML models eventually.

gregolo|1 year ago

And OpenAI uses Tesseract in the background, as it sometimes answers that Hungarian language is not installed for Tesseract for me

s5ma6n|1 year ago

I would be extremely surprised if that's the case. There are "open-source" multimodal LLMs can extract text from images as a proof that the idea works.

Probably the model is hallucinating and adding "Hungarian language is not installed for Tesseract" to the response.