While this approach may seem simpler, this project method utilizes a more optimized and faster model, resulting in improved efficiency and performance.
I was surprised to see there were no ML-related dependencies (neither models nor libraries), so I had a look at the code: The models are downloaded from Huggingface, and the repo comes with a precompiled whisper.cpp binary to execute them.
I have a question: I have 200-300 hours of audio recordings of interviews. I an using Otter.ai to automate transcription, and for each recording I export a ".vtt" file of the transcript.
What I'd like to do is create a type of ebook of all these transcripts, where if I click on a word, then the corresponding audio will start playing from roughly the same point in time within the interview.
Otter can do this already (if I'm online and logged in to their website), but I don't want to be tied to their website forever. I'd like to have a local copy that can perform similarly. Amazon ebooks can do this as well, I believe, where there is a corresponding verbatim audiobook. However, this project of mine is purely personal. I won't be selling my audio recordings or transcripts.
Any advice? Could software discussed here be helpful in what I'm trying to accomplish?
If you already have a .vtt, this is not a hard exercise to do e.g. entirely in a browser: parse the .vtt (they're simple text), lay out the text as you like with each segment being a clickable element (e.g. a link), and hook that up to seek an `<audio>` element to where you like.
AFAIK Whisper still can't handle multi-language content. If the audio has two languages (different narrators, for example), Whisper transcribes both of them during the first minute or so, and then either entirely skips one of the languages, or translates the foreign language to English, for the rest of the audio.
So, the value proposition of a subtitle-generating wrapper for Whisper would be to have an option to split audio into ~1 minute segments, transcribe them separately, and to somehow accurately join them. And I don’t think this one does such a thing.
I could see myself using this, subtitling things is extremely time consuming and there aren't that many tools which will automate it for you. It looks pretty straightforward to use - just two steps to install (if you already have FFMpeg and Python), and then one command to run the script.
Well done!
I wonder how much more a model would learn about subtitles from including audio AND video in training. Sure, the costs would be way bigger (parsing video even deterministically is 1.5 orders of magnitude worse than audio) but it might help with the edge cases where the speech is so unclear even the subtitle scene can't agree.
I'm not a native English speaker and I tend to use the LiveCaption application in Linux when I attend English speaking online meetings. Would love to have the opportunity to have subtitles in my native language (Greek) too while doing so.
I do the same with tech oriented podcasts. They have a clear speech, so transcribing them right it's very easy to do.
Non-native English speaker here, too.
[+] [-] ipsum2|2 years ago|reply
[0]: https://github.com/openai/whisper/blob/e58f28804528831904c3b...
[+] [-] dicytea|2 years ago|reply
I wonder if the whole thing is just an AI-generated project. The "About Me" section is pretty illuminating (unabridged):
> I'm a Developer i will feel the code then write.
[+] [-] innovatorved|2 years ago|reply
[+] [-] codethief|2 years ago|reply
[+] [-] innovatorved|2 years ago|reply
[+] [-] einpoklum|2 years ago|reply
* What languages are supported? Is there a list?
* What does 'subtitle' do, which 'whisper' doesn't?
* How do I install this system-wide on an apt-based system (in which pip install --system doesn't work)?
[+] [-] socks|2 years ago|reply
[+] [-] vjulian|2 years ago|reply
What I'd like to do is create a type of ebook of all these transcripts, where if I click on a word, then the corresponding audio will start playing from roughly the same point in time within the interview.
Otter can do this already (if I'm online and logged in to their website), but I don't want to be tied to their website forever. I'd like to have a local copy that can perform similarly. Amazon ebooks can do this as well, I believe, where there is a corresponding verbatim audiobook. However, this project of mine is purely personal. I won't be selling my audio recordings or transcripts.
Any advice? Could software discussed here be helpful in what I'm trying to accomplish?
[+] [-] akx|2 years ago|reply
If you already have a .vtt, this is not a hard exercise to do e.g. entirely in a browser: parse the .vtt (they're simple text), lay out the text as you like with each segment being a clickable element (e.g. a link), and hook that up to seek an `<audio>` element to where you like.
[+] [-] rainburg|2 years ago|reply
So, the value proposition of a subtitle-generating wrapper for Whisper would be to have an option to split audio into ~1 minute segments, transcribe them separately, and to somehow accurately join them. And I don’t think this one does such a thing.
[+] [-] nottorp|2 years ago|reply
[+] [-] extua|2 years ago|reply
[+] [-] innovatorved|2 years ago|reply
[+] [-] callalex|2 years ago|reply
[+] [-] Vaslo|2 years ago|reply
[+] [-] hr2016|2 years ago|reply
[+] [-] btdmaster|2 years ago|reply
[+] [-] innovatorved|2 years ago|reply
[deleted]
[+] [-] benob|2 years ago|reply
It gives pretty good subtitles.
[+] [-] whywhywhywhy|2 years ago|reply
[+] [-] elkos|2 years ago|reply
I'm not a native English speaker and I tend to use the LiveCaption application in Linux when I attend English speaking online meetings. Would love to have the opportunity to have subtitles in my native language (Greek) too while doing so.
[+] [-] anthk|2 years ago|reply
[+] [-] einpoklum|2 years ago|reply
https://github.com/innovatorved/subtitle/issues/6
[+] [-] epups|2 years ago|reply
[+] [-] lern_too_spel|2 years ago|reply
There is currently a problem with diarization, but otherwise, it is SOTA.
[+] [-] innovatorved|2 years ago|reply
[+] [-] butz|2 years ago|reply
[+] [-] alberth|2 years ago|reply
I hope Siri does something to improve. It’s voice-to-text for me is still horrible.