I find it challenging to accept something that talks about "OCR" but then I upload a PDF with text in images, and when I query the document after upload, I get a message that says "I can't interpret images"..
Then are you actually doing OCR, or are you just extracting embedded text?
I’d imagine their capabilities mirror that of Mistral OCR [1]. Mistral outputs markdown, the image would have to be convertible to a reasonably useful markdown structure (charts, tables etc).
Same vein - YouTube most (all?) llm integrations just scrape the transcript. I -think- google's aistudio does more but I'm unsure.
I mean I get it bulk video processing would be crazy expensive, but at least mention you're only analyzing the transcript especially if you're a paid product.
Honestly, the vibes aren't great. Gemini is a lot more flexible for handling PDFs - you can prompt it to do a bunch of other things - and Mistral OCR appears to hallucinate if it can't correctly read handwriting, a common problem with vision LLM based OCR tools.
The way Mistral OCR handles images within the text is disappointing - it doesn't attempt to interpret them, just extracts them out as binary blobs. A vision LLM can usually do a great job of describing an image, but with Mistral OCR you have to manually run that as a separate step.
Knowing that you have to do that as a separate step adds a whole additional level of complexity too.
For example, if some content has the images and some don't, you need to add whole additional steps to your processing and potentially add hallucinations in.
What are you using for document extraction lately, Simon?
I have a question about Mistral OCR. If I give the model a PDF that is 90% text, is it actually performing OCR on an image representation of the text? Or is it smart enough to extract the text directly and only use OCR on images?
[+] [-] sbarre|1 year ago|reply
Then are you actually doing OCR, or are you just extracting embedded text?
[+] [-] 0x62|1 year ago|reply
[1] https://mistral.ai/en/news/mistral-ocr
[+] [-] tmpz22|1 year ago|reply
I mean I get it bulk video processing would be crazy expensive, but at least mention you're only analyzing the transcript especially if you're a paid product.
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] setnone|1 year ago|reply
[+] [-] bilater|1 year ago|reply
[+] [-] simonw|1 year ago|reply
Honestly, the vibes aren't great. Gemini is a lot more flexible for handling PDFs - you can prompt it to do a bunch of other things - and Mistral OCR appears to hallucinate if it can't correctly read handwriting, a common problem with vision LLM based OCR tools.
The way Mistral OCR handles images within the text is disappointing - it doesn't attempt to interpret them, just extracts them out as binary blobs. A vision LLM can usually do a great job of describing an image, but with Mistral OCR you have to manually run that as a separate step.
[+] [-] brianjking|1 year ago|reply
For example, if some content has the images and some don't, you need to add whole additional steps to your processing and potentially add hallucinations in.
What are you using for document extraction lately, Simon?
[+] [-] bilater|1 year ago|reply
[+] [-] bilater|1 year ago|reply
So cool in fact, I got distracted and ended up building an open source PDF parser and chat app!
Presenting Auntie PDF - your all-knowing guide that unpacks every PDF into clear, actionable insights.
You can upload a pdf or point to a public link, parse it, and then ask questions. All open source and free.
[+] [-] onebitwise|1 year ago|reply
Not working for me on a file like this: https://files.catbox.moe/gii0pu.pdf It says that is larger than 10MB (it's 7MB), or failed on url.
[+] [-] pogue|1 year ago|reply
[+] [-] jbaudanza|1 year ago|reply
[+] [-] foundzen|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] t-3|1 year ago|reply
[+] [-] qingcharles|1 year ago|reply
[+] [-] elanning|1 year ago|reply
[+] [-] bilater|1 year ago|reply
[+] [-] JoelJacobson|1 year ago|reply
Would be nice with a [Download Combined Rendered] button to download a self-contained .html web page of the rendered combined page.
[+] [-] bilater|1 year ago|reply
[+] [-] shnpln|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] daft_pink|1 year ago|reply
[+] [-] mjyoon|1 year ago|reply
[+] [-] yannis|1 year ago|reply
[+] [-] bilater|1 year ago|reply
[+] [-] ab_testing|1 year ago|reply
[+] [-] bilater|1 year ago|reply
[+] [-] triyambakam|1 year ago|reply
[+] [-] bilater|1 year ago|reply
[+] [-] n8m8|1 year ago|reply
[+] [-] bilater|1 year ago|reply
[+] [-] throwaway81348|1 year ago|reply
[+] [-] eastoeast|1 year ago|reply
[+] [-] bilater|1 year ago|reply