top | item 43044769

(no title)

mometsi | 1 year ago

> How is this different from tesseract and friends?

The workflow is for digitizing historical printed documents. Think conserving old announcements in blackletter typesetting, not extracting info from typewritten business documents.

discuss

amelius|1 year ago

I didn't have good results in tesseract, so I hope this is really different ;)

I was surprised that even scraped screen text did not work 100% flawlessly in tesseract. Maybe it was not made for that, but still, I had a lot of problems with high resolution photos also. I did not try scanned documents, though.

Moto7451|1 year ago

I have never had to handle handwriting professionally but I have had great success with Tesseract in the past. I’m sure it’s no longer the best free/cheap option but with a little bit of image pre-processing to ensure the text pops from the background and isn’t unnecessarily large (I.e. that 1200dpi scan is overkill) you can have a pretty nice pipeline with good results.

In the mid 2010s I put Tesseract, OCRad (which is decidedly not state of the art), and aspell into a pretty effective text processing pipeline to transform resumes into structured documents. The commercial solutions we looked at (at the time) were a little slower and about as good. If the spellcheck came back with too low of a success rate I ran the document through OCRad which, while simplistic, sometimes did a better job.

I expect the results today with more modern projects to be much better so I probably wouldn’t go that path again. However as all of it runs nicely on slow hardware, it likely still has a place on low power/hobby grade IoT boards and other niches.

spigottoday|1 year ago

I have a typewriter written manuscript that is interspersed with hand written editing. Tesseract worked fine until the hand written part, then garbage. Is there a local solution that anyone can recommend? I have a 16gb lenovo laptop and access to a workstation with a with an RTX 4070 ti 16gb card. Thanks.

bonefolder|1 year ago

Tangentially related, but does someone know a resource for high-quality scans of documents in blackletter / fraktur typesetting? I'm trying to convert documents to look fraktury in latex and would like any and all documents I can lay my hands on.