A new benchmark study evaluates Vision-Language Models (Claude-3, Gemini-1.5, GPT-4o) against traditional OCR tools (EasyOCR, RapidOCR) for extracting text from videos. The findings show VLMs outperforming OCR in many cases but also highlight challenges like hallucinated text and handling occluded/stylized fonts.
The dataset (1,477 manually annotated frames) and benchmarking framework are publicly available to encourage further research.
ashu_trv|1 year ago
The dataset (1,477 manually annotated frames) and benchmarking framework are publicly available to encourage further research.
Paper: https://arxiv.org/abs/2502.06445 Dataset & Repo: https://github.com/video-db/ocr-benchmark
Would love to hear thoughts from the community on the future of VLMs in OCR.