top | item 43045801

Benchmarking vision-language models on OCR in dynamic video environments

142 points| ashu_trv | 1 year ago |arxiv.org

58 comments

order

alberto-m|1 year ago

It seems to me that the software is occasionally doing better than the supposed “ground truth” (who annotated that?), and I don't understand why the authors are blindly following the latter, and the reviewers apparently approved that.

In Figure 1 the authors complain that Gemini “misreads 'ss ety!' as 'ness ety!'”, but even a casual look at the image reveals that Gemini's reading is correct.

In Figure 11, they state that Claude is “altering the natural sequence of ideas in the ground truth”, except that the sequence in the ground truth makes no sense, while Claude's order does (only the initial “the” is misplaced).

virgilp|1 year ago

I think the goal here was to convince the AI to actually read chars ("OCR") rather than speculate what might be written on paper/in the image. Hence why the ground truth is explicitly removing the letters & word parts that are obscured, even when they can be guessed.

TBH, I'm not sure it's a good test. I can somewhat see the argument against "BASELINE" for ground truth - the underlying text might have been BASE(IAKS), for all we know. But, IMO the ground truth should have been "Direction & ess" at the very least. And, more significantly than that - it's a fake scenario, that we don't care for in practice. Why use that? Use invoices with IDs that sound like words but are not. Use license plates and stuff like that. Heck, use large prints of random characters, mixed with handwritten gibberish.

For at least some of images that they used, the expectation from a good text reader is actually to understand context and not blindly OCR. Take "Trader Joe's": we *know* that's an 's', but only from outside context; from OCR, it might've been an 8, there's really no way to tell. Why accept the "s" in ground truth, but reject the full world "Coconut" (which is obviously what is written on the can, even if partially obscured)? Furthermore, a human would know what kind of products are sold by Trader Joe's, and coupling that with the top of the letters "M I L" that are visible, would deduce that's Coconut Milk. So really, Claude nailed that one.

8organicbits|1 year ago

I think there are multiple possible goals we could imagine in text recognition tasks. Should the AI guess the occluded text? That could be really helpful in some instances. But if the goal is OCR, then it should only recognize characters optically, and any guessing at occluded characters is undesired.

bufferoverflow|1 year ago

In the very first example (occluded text) the "ground truth" is just incorrect.

dimatura|1 year ago

Re: reviewers, I don't see any mention of this being accepted into a peer-reviewed venue. Peer review isn't necessary for arxiv submissions.

Vt71fcAqt7|1 year ago

>reviewers apparently approved that.

What reviewers?

spwa4|1 year ago

This looks a lot like "compared to a bunch of people who are 10 years behind (non-transformer, vision-only models), and people who aren't trying (aren't optimizing for OCR) Google is doing real well"

EasyOCR is LSTM-CTC from 2007, RapidOCR is a ConvNet approach from 2021, both focused on speed. Both will vastly outperform almost any transformer model, and certainly a big one, on speed and memory usage, but they aren't state of the art on accuracy. This is well known, for a decade at this point. 2 decades for LSTM-CTC.

Plus, I must say the GPT-4o results look a lot saner. "COCONUT" (GPT-4o) vs "CONU CNBC" (Gemini) vs Ground Truth "C CONU CNBC". And, obviously the ground truth should be "COCONUT MILK" (the word milk is almost entirely out of the picture, but is still the right answer that a human would give). The "C CONU" comes from the first O of COCONUT being somewhat obscured by a drawing of ... I don't know what the hell that is. It's still very obvious it's meant to be "COCONUT MILK", so the GPT-4o answer is still not quite perfect, but heaps better than all the others.

Now this looks very much like it might be temperature related, and I can find nothing in the paper about changing the temperature, which is imho a very big gap (temperature gives transformer models more freedom to choose more creative answers. The better performance of GPT-4o might well be the result of such a more creative choice, and might also explain why Gemini is trying so hard to stay so very close to the ground truth. It's still quite the accomplishment to succeed, but GPT-4o is still better)

michaelt|1 year ago

> And, obviously the ground truth should be "COCONUT MILK" (the word milk is almost entirely out of the picture, but is still the right answer that a human would give).

Maybe? Seems application-dependent to me.

If you're OCRing checks or invoices or car license plates or tables in PDF documents, you might prefer a model that's more conservative when it comes to filling in the blanks!

And even when recognising packaged coconut products, you've also got your organic coconut oil, organic coconut milk with reduced fat, organic coconut cream, organic coconut flakes, organic coconut dessicated chips, organic coconut and strawberry bites, organic coconut milk powder, organic coconut milk block, organic coconut milk 9% fat, organic coconut yoghurt, organic coconut milk long life barista-style drink, organic coconut kefir, organic coconut banana and pear baby food pouches, organic coconut banana and pineapple smoothie, organic coconut scented body wash and so on.

dylan604|1 year ago

>The "C CONU" comes from the first O of COCONUT being somewhat obscured by a drawing of ... I don't know what the hell that is.

It's clearly the stem from the bell pepper in front of the can. You're complaining that the software is lesser than a human, yet it appears your human needs better training in understanding context too.

pilooch|1 year ago

The question is what is OCR for ? If it's to answer questions and work with a document, then VLMs do actually contain self correcting mechanisms. That is, the end to end image + text input to text output is statistically grounded, by training. So the question to ask is what do you need OCR for ? Fedding an LLM? Then feed it to the VLM instead. Some other usage ? Well, to be decided. But near now, CTX and lstms are done with, because VLMs do everything: finding the area to read, reading, embedding, and answering. OCR was a mid-step, it's going away.

infecto|1 year ago

It's not obvious at all—it depends on the use case.

You also didn’t really counter the paper. Sure, the OCR models are old, but what should they have tested instead? Are there better open-source OCR models available that would have made for a fairer comparison?

cvz|1 year ago

This is what's so terrifying about uses of "AI". People's idea of accuracy being "tell me what I think is there", not "tell me what's there". The can in this image probably says "coconut milk", but the image certainly doesn't.

speerer|1 year ago

I think it's useful to add the context that CNBC is correct and does appear at the top right of that picture. CNBC is not a mis-transcribing of MILK, and the letters M, I, L and K are not actually visible in the picture.

yorwba|1 year ago

What would you say is currently the most accurate OCR solution if you're not concerned about speed and memory usage?

_stillmind|1 year ago

The paper says, "GPT-4o achieves the highest overall accuracy, while Gemini-1.5 Pro demonstrates the lowest word error rate." Saying Gemini "beats everyone" in this benchmark is misleading.

silveraxe93|1 year ago

Posted 4 days ago:

> Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o

Literally none of those are state of the art. Academia is completely unprepared to deal with the speed Ai develops. This is extremely common in research papers.

That's literally in the abstract. If I can see a completely wrong sentence 5 seconds into reading the paper, why should I read the rest?

michaelt|1 year ago

What models would you recommend instead, for sophisticated OCR applications?

Honestly I thought Claude-3 and GPT-4o were some of the newest major models with vision support, and that models like o1 and deepseek were more reasoning-oriented than OCR-oriented.

lisnake|1 year ago

They may have been SotA at the moment of writing

eptcyka|1 year ago

The speed of publishing is just too slow. If you want to apply any kind of scientific rigor and have your peers check what you're doing (not even doing a full peer review), things take more time than just posting on blogs and iterating.

hubraumhugo|1 year ago

As someone building in this space, we've found that raw OCR accuracy is just one piece (and it's becomming a commodity).

The real challenge is building reliable and accurate ETL pipelines (document ingestion from web, OCR, classification, validation, etc.) that work at scale in production.

The best products will be defined by everything "non-AI", like UX, performance, and human-in-the loop feedback loop for non-techies.

Avoiding over-reliance on specific models also helps. With good internal eval data and benchmarks, you can easily switch or fine-tune models.

mtrovo|1 year ago

That’s the point of using AI in the first place. If your product is just a polished interface on top of a prompt, then your moat isn’t that strong, and chances are your product will be commoditized soon.

By building a good UX and integrating it with other processes that require traditional collaboration, you increase the chances that replicating your secret sauce is either infeasible or too difficult for newcomers to bother.

HannesWes|1 year ago

This looks very interesting. I conducted some explorations of whether LLMs can be used to extract information from hand-written forms [0][1]. Such a system could allow users to snap pictures of forms and other legal documents, automatically extract structured information, and use this information to e.g. automatically fill out new forms or determine whether the user has the right to a government benefit.

The initial results were quite promising, as GPT-4o could reliably identify the correct place in the form for the information, and moderately reliably extract the values, even if the image was blurry or the text was sloppily written. Excited to see how Gemini 2.0 would do on this task!

[0] https://arxiv.org/abs/2412.15260

[1] https://github.com/hwestermann/AI4A2J_analyzing_images_of_le... (code and data)

nolok|1 year ago

I have lots of customer files and I've looked around with all these AI tools for something, paid or self hosted or whatever, where I point it to a folder with xlsx and pdf and then I can query "Whats the end date or M Smith contract" or "How much does M Smith still owe" and I've been very disappointed by that, it's either very complicated, or they break down with non text based pdf, or...

It feels to me that if you need to provide schema and preprocess the data and this and that at the end all AI provide is a way to do some SQL in natural language, meaning yes it's better but it doesn't remove the actual pain point if you're a tech user.

Then again maybe I'm wrong, didn't find the right tool or didn't understand it.

Is what I'm looking for something that actually exists (and works, not just on simple cases)?

fhd2|1 year ago

I worked on this a bit 1-2 years ago. Back then, LLMs weren't really up to the task, but I found them OK for suggestions that a human double checks. Brings us to the Ironies of Automation though (human oversight of automation with a review process doesn't really work, it's a paper worth reading).

We tried several dedicated services for extracting structured data and factoids like that from documents: First Google Document AI, then a dedicated provider focusing solely on our niche. Back then, that gave the best results.

There wasn't enough budget to go deeper into this and we just reverted to doing it manually. But I think a really cool way to do this would be to make a user friendly UI where they can see suggestions and the text snippets they were extracted from as they skim through the document, with a simple way to modify and accept these. I think that'd work to scale the process quite a bit. Focusing the attention of the human at the relevant parts of the document basically.

Haven't worked on this space since then, but I'm pretty bearish on fully automated fact extraction. Getting stuff in contracts and invoices wrong is typically not acceptable. I think a solid human in the loop approach is probably still the way to go.

tpm|1 year ago

I'm not completely up to date but a few months ago Qwen2-VL (runnable locally) was able to perfectly read text from images. So I'd say you would still need to preprocess that folder to texts to get any reasonable speed for queries but after that if you feed the data to a LLM with long enough context it should just work. If on the other hand it's too much data and the LLM is required to use tools then it is indeed still too soon. But it is coming.

malanj|1 year ago

If you're wondering how they prompt the models:

"Perform OCR on this image. Return only the text found in the image as a single continuous string without any newlines, additional text, or commentary. Separate words with single spaces. For any truncated, partially visible, or occluded text, include only the visible portions without attempting to complete or guess the full text. If no text is present, return empty double quotes."

Found in: https://github.com/video-db/ocr-benchmark/blob/main/prompts....

Terretta|1 year ago

TL;DR: For original object truth rather than image truth, this paper shows VLMS are superior, even though prompt shows the authors are "holding it wrong".

Yet another paper where the authors don't address what tokens are. It's like publishing Rolling pin fails at math or Calculator fails to turn dough ball into round pizza.

While I can understand where they're coming from in a desire to avoid hallucination when doing some letter for letter transcription from an image, certainly most times you reach for OCR you want the original copy, despite damage to its representation (paper tears, coffee stains, hands in front of it). Turns out token conjunction probability conjectures come in handy here!

Whether the image of an object, or the object, is "Ground Truth" is an exercise left to the user's goal. Almost all use cases would want what was originally written on the object, not its present occlulded [sic] representation.

retskrad|1 year ago

People say CPU benchmarks are meaningless (what does even 10-15% better mean in practice?) but LLM benchmarks are even more of a mystery. The same LLM will produce a novel output everytime you given it the exakt same prompt.

deivid|1 year ago

Are there any "good" OCR models that run in restricted/small environments? I'm thinking about local models for phone-sized CPUs

Obviously these models would have lower accuracy, but running at all would be nice.

echelon|1 year ago

Are there any benchmarks (speed, accuracy, etc.) for non-OCR use cases? I want to label images and videos, but don't really care about text.

croes|1 year ago

Does everyone also need huge data centers at lots of energy?

belter|1 year ago

Gemini is so bad I gladly cancelled my paid account. But hey, maybe AI and 50B dollars is what was needed to get a better OCR...

bobjordan|1 year ago

Really? This surprises me because I use open-AI pro for $200 per month and I still fall back to using my Gemini $20 per month account a lot these days, I like the new 2.0 experimental speed and how it defaults to diving into producing usable code, immediately. Whereas, my open ai pro mode will spend a few minutes to give me an initial answer that beats around the bush at a much higher level to start. So, my workflow has evolved to using Gemini to iterate my initial thinking and frame out requirements and first draft code. Then, when I get about 2,000 - 3,000 lines for a detailed initial pro mode prompt, I send that to open ai pro mode and then it shines. But, I really like starting with the Gemini 2.0 model first. The main thing I dislike about Gemini is I often need to tell it “please continue” when it reaches its output limit. But it nearly always just picks up right where it left off and continues its output. This is critical in using Gemini.

casey2|1 year ago

It's not surprising that google has such a huge mote with their highly illegal and unethical activity of scanning and digitizing billions of pages of copyrighted work to train their models. Oh wait, google books search was fair use. I got it confused with LLMs.

Terretta|1 year ago

> It's not surprising that google has such a huge mote with their highly illegal and unethical activity of scanning and digitizing billions of pages of copyrighted work to train their models.

Excellent Freudian slip (proverb allusion suggesting Google has a blind spot, while discussing OCR).