top | item 44647852

(no title)

In this blog we detail the api design and technical decisions we made when adding audio video support to Ragie's RAG service. We explore some of the approaches we tried and the rationale behind what we landed on. Worth a read if you're building similar systems.

Here's a TLDR: - Built a full pipeline that processes audio/video → transcription + vision descriptions → chunking → indexing - Audio: faster-whisper with large-v3-turbo (4x faster than vanilla Whisper) - Video: Chose Vision LLM descriptions over native multimodal embeddings (2x faster, 6x cheaper, better results) 15-second video chunks hit the sweet spot for detail vs context - Source attribution with direct links to exact timestamps

Happy to answer any further questions folks might have!

discuss

bobremeika|7 months ago

Source attribution with direct links to exact timestamps is truly unique when it comes to A/V RAG solutions.