top | item 44936754

(no title)

ggnore7452 | 6 months ago

I’ve done a similar PDF → Markdown workflow.

For each page:

- Extract text as usual.

- Capture the whole page as an image (~200 DPI).

- Optionally extract images/graphs within the page and include them in the same LLM call.

- Optionally add a bit of context from neighboring pages.

Then wrap everything with a clear prompt (structured output + how you want graphs handled), and you’re set.

At this point, models like GPT-5-nano/mini or Gemini 2.5 Flash are cheap and strong enough to make this practical.

Yeah, it’s a bit like using a rocket launcher on a mosquito, but this is actually very easy to implement and quite flexible and powerfuL. works across almost any format, Markdown is both AI and human friendly, and surprisingly maintainable.

discuss

GaggiX|6 months ago

>are cheap and strong enough to make this practical.

It all depends on the scale you need them, with the API it's easy to generate millions of tokens without thinking.

agentcoops|6 months ago

You don't need full reasoning to get accurate results, so even with GPT5 it's still pretty cheap for a one-time job and easy to reason about costs. It's certainly cheaper if you have data where reliability is key and classical OCR will undoubtedly require some manual data cleaning...

I can recommend the Mistral OCR API [1] if you have large jobs and don't want to think about it too much.

[1] https://mistral.ai/solutions/document-ai

rdos|6 months ago

In that case you should run a model locally, this one for example: https://huggingface.co/ds4sd/docling-models