top | item 41939444

(no title)

forresti | 1 year ago

VideoLLM from Meta! LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Paper: https://huggingface.co/papers/2410.17434 Code: https://github.com/Vision-CAIR/LongVU Project (Demo): https://vision-cair.github.io/LongVU

We propose LongVU, a video LLM with a spatiotemporal adaptive compression mechanism designed for real-world hour-long video understanding. LongVU adaptively reduces the number of video tokens by leveraging (1) DINOv2 feature similarity across frames, (2) Cross-modal text-frame similarity, and (3) temporal frame similarity.

1. High quality on video-based QA: 67.6% on EgoSchema, 66.9% on MVBench, 65.4% on MLVU and 59.5% on VideoMME long 2. +5% accuracy boost on average across various video understanding benchmarks compared to LLaVA-OneVision and VideoChat2 3. Our edge model, LongVU-3B, also outperformed 4B counterparts such as VideoChat2(Phi-3) and Phi-3.5-vision-instruct by a large margin.

discuss

No comments yet.