We parse millions of PDFs using Apache Tika and process about 30,000 per dollar of compute cost. However, the structured output leaves something to be desired, and there are a significant number of pages that Tika is unable to parse.https://tika.apache.org/
rudolph9|1 year ago
https://tesseract-ocr.github.io/tessdoc/