For what it is worth, we work on a tool[0] to index all local videos and images and later allowing query just using natural language. It is based on CLIP which has been trained on image-text pairs, but seems to work great for videos after applying some naive heuristics.
aabbcc1241|2 years ago
[1] https://youglish.com/
warangal|2 years ago
[0] https://github.com/ramanlabs-in/hachi
jimmySixDOF|2 years ago
https://clipbase.xyz/