top | item 43284091

(no title)

lokl | 1 year ago

Tried with a few historical handwritten German documents, accuracy was abysmal.

discuss

order

lysace|1 year ago

Semi-OT (similar language): The national archives in Sweden and Finland published a model for OCR:ing handwritten Swedish text from the 1600s to the 1800s with what to me seems like a very level of accuracy given the source material. (4% character error rate)

https://readcoop.eu/model/the-swedish-lion-i/

https://www.transkribus.org/success-story/creating-the-swedi...

https://huggingface.co/Riksarkivet

They have also published a fairly large volume of OCR:ed texts (IIRC birth/death notices from church records) using this model online. As a beginner genealogist it's been fun to follow.

Thaxll|1 year ago

HTR ( Handwritten Text Recognition ) is a completely different space than OCR. What were you expecting exactly?

riquito|1 year ago

It fits the "use cases" mentioned in the article

> Preserving historical and cultural heritage: Organizations and nonprofits that are custodians of heritage have been using Mistral OCR to digitize historical documents and artifacts, ensuring their preservation and making them accessible to a broader audience.

butovchenkoy|11 months ago

For this task, general models will always perform poorly. My company trains custom gen ai models for document understanding. We recently trained a VLM for the German government to recognize documents written in old German handwriting, and it performed with exceptionally high accuracy.

rvnx|1 year ago

Probably they are overfitting the benchmarks, since other users also complain of the low accuracy

thadt|1 year ago

Also working with historical handwritten German documents. So far Gemini seems to be the least wrong of the ones I've tried - any recommendations?

butovchenkoy|11 months ago

my recommendation is to train a custom model

anothermathbozo|1 year ago

Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) are different tasks