top | item 41988640

Pdf-extract-API – OCR with marker-pdf and transform with Llama – single API call

5 points| pkarwatka | 1 year ago |github.com

1 comment

I just published a side-project. This is a simple and easy to use API around marker-pdf state-of-the-art OCR and Ollama LLM. All set-up in a docker-compose so one might run it 100% on-prem and use for a variety of use cases - for example feeding data to RAG/LLMs.

Features: - No Cloud/external dependencies all you need: - PyTorch based OCR (Marker) + Ollama are shipped and configured via docker-compose no data is sent outside your dev/server environment, - PDF to Markdown conversion with very high accuracy using different OCR strategies including marker, surya-ocr or tessereact - PDF to JSON conversion using Ollama supported models (e.g. LLama 3.1) - LLM Improving OCR results LLama is pretty good with fixing spelling and text issues in the OCR text - Removing PII This tool can be used for removing Personally Identifiable Information out of PDF - see examples - Distributed queue processing using Celery, - Caching using Redis - the OCR results can be easily cached prior to LLM processing - CLI tool for sending tasks and processing results

Looking for some motivation to move the needle! Contributions are welcome.