(no title)
eevmanu | 4 months ago
Is there anything you can share about the architecture or pipeline you used for it? A high-level overview would be enough.
I’m guessing you’re doing video-to-image, image-to-text, and then text-to-docs, right? Since not all of the models you mentioned are multimodal.
alexattt|4 months ago
More or less, I have python worker that does the video processing job. Video into frames, frame deduplication, frame LLM analysis and then generating docs from that information. Soon audio narration analysis will be added too!