(no title)
c_moscardi | 1 year ago
1. I believe the IR jargon for getting a JSON of this form is Key Information Extraction (KIE). MS has an out-of-the-box model for this. I just tried the screenshot and it did a pretty good (but not perfect) job. It didn't get every form field, but most. MS sort-of has a flow for fine-tuning, but it really leaves a lot to be desired IMO. Curious if this would be "good enough" to satisfy the use case.
2. In terms of just OCR (i.e. getting the text/numeric strings correct), MS is known to be the best on typed text at the moment [1]. Handwriting is a different beast... but it looks like MS is doing a very good job there (and SOTA on handwriting is very good). In particular, it got all the numbers in that screenshot correct.
If you want to see the results from MS on the screenshot in this blog post, here's the entire JSON blob. A bit of a behemoth but the key/value stuff is in there: https://gist.github.com/cmoscardi/8c376094181451a49f0c62406e...
[1] https://mindee.github.io/doctr/latest/using_doctr/using_mode...
simonw|1 year ago
Sending images through that API and then using an LLM to extract data from the text result from the OCR could be worth exploring.