Great article. This may be my all-time favorite deep dive post on RAG strategies.
Itβs super interesting to me how the process of fully making audio/video searchable requires so much processing. Like, extracting the audio and video, transcribing the audio, chunking the video into 15-sec scenes and describing them visually, etc.
I wonder if as a test you could use the video descriptions, run them as a prompt through something like Veo, then stitch them together into something close to the original. Wild.
In this blog we detail the api design and technical decisions we made when adding audio video support to Ragie's RAG service. We explore some of the approaches we tried and the rationale behind what we landed on. Worth a read if you're building similar systems.
Here's a TLDR:
- Built a full pipeline that processes audio/video β transcription + vision descriptions β chunking β indexing
- Audio: faster-whisper with large-v3-turbo (4x faster than vanilla Whisper)
- Video: Chose Vision LLM descriptions over native multimodal embeddings (2x faster, 6x cheaper, better results)
15-second video chunks hit the sweet spot for detail vs context
- Source attribution with direct links to exact timestamps
Happy to answer any further questions folks might have!
rdegges|7 months ago
Itβs super interesting to me how the process of fully making audio/video searchable requires so much processing. Like, extracting the audio and video, transcribing the audio, chunking the video into 15-sec scenes and describing them visually, etc.
I wonder if as a test you could use the video descriptions, run them as a prompt through something like Veo, then stitch them together into something close to the original. Wild.
mkauffman23|7 months ago
mkauffman23|7 months ago
Here's a TLDR: - Built a full pipeline that processes audio/video β transcription + vision descriptions β chunking β indexing - Audio: faster-whisper with large-v3-turbo (4x faster than vanilla Whisper) - Video: Chose Vision LLM descriptions over native multimodal embeddings (2x faster, 6x cheaper, better results) 15-second video chunks hit the sweet spot for detail vs context - Source attribution with direct links to exact timestamps
Happy to answer any further questions folks might have!
bobremeika|7 months ago