top | item 44647851

We built audio/video RAG

5 points| mkauffman23 | 7 months ago |ragie.ai

4 comments

rdegges|7 months ago

Great article. This may be my all-time favorite deep dive post on RAG strategies.

It’s super interesting to me how the process of fully making audio/video searchable requires so much processing. Like, extracting the audio and video, transcribing the audio, chunking the video into 15-sec scenes and describing them visually, etc.

I wonder if as a test you could use the video descriptions, run them as a prompt through something like Veo, then stitch them together into something close to the original. Wild.

mkauffman23|7 months ago

I have no idea how accurate the reconstruction would be but it would make for a wild experminent!

mkauffman23|7 months ago

In this blog we detail the api design and technical decisions we made when adding audio video support to Ragie's RAG service. We explore some of the approaches we tried and the rationale behind what we landed on. Worth a read if you're building similar systems.

Here's a TLDR: - Built a full pipeline that processes audio/video → transcription + vision descriptions → chunking → indexing - Audio: faster-whisper with large-v3-turbo (4x faster than vanilla Whisper) - Video: Chose Vision LLM descriptions over native multimodal embeddings (2x faster, 6x cheaper, better results) 15-second video chunks hit the sweet spot for detail vs context - Source attribution with direct links to exact timestamps

Happy to answer any further questions folks might have!

bobremeika|7 months ago

Source attribution with direct links to exact timestamps is truly unique when it comes to A/V RAG solutions.