top | item 45700273

(no title)

eevmanu | 4 months ago

Looks awesome!

Is there anything you can share about the architecture or pipeline you used for it? A high-level overview would be enough.

I’m guessing you’re doing video-to-image, image-to-text, and then text-to-docs, right? Since not all of the models you mentioned are multimodal.

discuss

alexattt|4 months ago

Thanks! :)

More or less, I have python worker that does the video processing job. Video into frames, frame deduplication, frame LLM analysis and then generating docs from that information. Soon audio narration analysis will be added too!