top | item 44287838

(no title)

souvik3333 | 8 months ago

We have not trained explicitly on handwriting datasets (completely handwritten documents). But, there are lots of forms data with handwriting present in training. So, do try on your files, there is a huggingface demo, you can quickly test there: https://huggingface.co/spaces/Souvik3333/Nanonets-ocr-s

We are currently working on creating completely handwritten document datasets for our next model release.

discuss

Eisenstein|8 months ago

Document:

* https://imgur.com/cAtM8Qn

Result:

* https://imgur.com/ElUlZys

Perhaps it needed more than 1K tokens? But it took about an hour (number 28 in queue) to generate that and I didn't feel like trying again.

How many tokens does it usually take to represent a page of text with 554 characters?

souvik3333|8 months ago

Hey, the reason for the long processing time is that lots of people are using it, and with probably larger documents. I tested your file locally seems to be working correctly. https://ibb.co/C36RRjYs

Regarding the token limit, it depends on the text. We are using the qwen-2.5-vl tokenizer in case you are interested in reading about it.

You can run it very easily in a Colab notebook. This should be faster than the demo https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...

There are incorrect words in the extraction, so I would suggest you to wait for the handwritten text model's release.