I am doing OCR on hundreds of PDFs using AWS Textract. It requires me to convert each page of the pdf to an image and then analyze the image and it works good for converting to markdown format (which requires custom code). I want to try using some vision models and compare how they do, for example Phi-3.5-vision-instruct.
No comments yet.