top | item 44092195

(no title)

carpo | 9 months ago

Ive almost finished the first version of a desktop video library app I've been writing for myself. I had the idea last year, but the cost of sending images to an LLM made it too expensive (to run over about 1500 videos), but now it's fairly reasonable.

In the app you pick a folder with videos in it and it stores the path, metadata, extracts frames as images, uses a local whisper model to transcribe the audio into subtitles, then sends a selection of the snapshots and the subtitles to an LLM to be summarised. The LLM sends back an XML document with a bunch of details about the video, including a title, detailed summary and information on objects, text, people, animals, locations, distinct moments etc. Some of these are also timestamped and most have relationships (i.e this object belongs to this location, this text was on this object etc). I store all that in a local SQLLite database and then do another LLM call with this summary asking for categories and tags, then store them in the DB against each video. The App UI is essentially tags you can click to narrow down returned videos.

I plan on adding a natural language search (Maybe RAG -- need to look into the latest best way), have half added Projects so I can group videos after finding the ones I want, and have a bunch of other ideas for this too. I've been programming this with some early help from Aider and Claude Sonnet. It's getting a bit complex now, so I do the majority of code changes, though the AI has done a fair bit. It's been heaps of fun, and I'm using it now in "production" (haha - on my PC)

discuss

order

willlma|9 months ago

This isn't entirely on-topic but I've been trying to understand why AI video editing isn't more common, and thought you might know. I've had an idea for a while to make tennis match highlight videos that show every single point of the match. Tennis has a lot of downtime between points (and even more between games and sets). I just want to tell an LLM: here's a two-hour long video of a tennis match. Strip out all the gaps between points. I'm guessing this would a very expensive frame by frame analysis of the video right now and that's why it's not done. Is that right or are there other reasons?

carpo|9 months ago

Yeah, it's probably too expensive and complicated to send all the frames to an LLM. I only send about 30 images at 576x324 which is around 15000 tokens a video, and comes to about $0.045 per video. First I save one frame every second and then loop through them comparing each to find the differences, and send just screenshots which have changed significantly, up to a max of 30. Claude only allows 100 images per API call, so it would be a bit fiddly and costly to handle 7000 frames.

Though, now that I'm thinking about it, you could probably do this locally and just look at the part of the image that has the current score, do some local OCR on it to check if the score has changed each frame, if it has, store the timestamp and then use ffmpeg to extract the correct parts. Probably wouldn't need an LLM at all.

As for editing, one thing I do in my videos is audio keywords so my app can do specific things. For example, I can say "AI, mark what I just said as important." Then when it transcribes the audio and the LLM processes it, it will mark that part as a Distinct Moment with a start and end timestamp, a title and description that will show in my app as a clickable link to that part of the video. I'm thinking of adding more commands for more complex editing too.

DANmode|9 months ago

I would do it based on the sound of the ball hitting in the audio track during service. Way cheaper ;)

TeamDman|9 months ago

Neat! Have a repo?

carpo|9 months ago

Thanks! I have been thinking about opening it up, but not sure, as I've never done any open source stuff and don't know how useful others would find it - there's some clunky bits!

I also have the whole Aider/Claude prompt history in the repo too, as I started this on a platform & framework I'd never used, and used the AI to scaffold much of the app at the beginning. Thought that might be useful to go back and see what worked the best when AI programming.