(no title)
ggnore7452 | 6 months ago
For each page:
- Extract text as usual.
- Capture the whole page as an image (~200 DPI).
- Optionally extract images/graphs within the page and include them in the same LLM call.
- Optionally add a bit of context from neighboring pages.
Then wrap everything with a clear prompt (structured output + how you want graphs handled), and you’re set.
At this point, models like GPT-5-nano/mini or Gemini 2.5 Flash are cheap and strong enough to make this practical.
Yeah, it’s a bit like using a rocket launcher on a mosquito, but this is actually very easy to implement and quite flexible and powerfuL. works across almost any format, Markdown is both AI and human friendly, and surprisingly maintainable.
GaggiX|6 months ago
It all depends on the scale you need them, with the API it's easy to generate millions of tokens without thinking.
agentcoops|6 months ago
I can recommend the Mistral OCR API [1] if you have large jobs and don't want to think about it too much.
[1] https://mistral.ai/solutions/document-ai
rdos|6 months ago