top | item 44641072

(no title)

You can add OCR with Gemini, and presumably that would lead to better results than the OCR model we compared against. However, it's important to note that then you're guaranteeing that the entire corpus of documents you're processing will go through a large VLM. That can be prohibitively expensive and slow.

Definitely trade-offs to be made here, we found this to be the most effective in most cases.

discuss

serjester|7 months ago

VLM’s capable of parsing images with high fidelity are 10 - 50X cheaper than the frontier models. Any savings from not parsing, are quickly going to be wiped out if someone has any actual traffic. Not to mention the massive hits to long context accuracy and latency.