(no title)
nvdnadj92 | 5 months ago
For preprocessing, I found it best to convert files to a 16kHz WAV format for optimal processing. I also add low-pass and high-pass filters to remove non-speech sounds. To avoid hallucinations, I run Silero VAD on the entire audio file to find timestamps where there's a speaker. A side note on this: Silero requires careful tuning to prevent audio segments from being chopped up and clipped. I also use a post-processing step to merge adjacent VAD chunks, which helps ensure cohesive Whisper recordings.
For the Whisper task, I run Whisper in small audio chunks that correspond to the VAD timestamps. Otherwise, it will hallucinate during silences and regurgitate the passed-in prompt. If you're on a Mac, use the whisper-mlx models from Hugging Face to speed up transcription. I ran a performance benchmark, and it made a 22x difference to use a model designed for the Apple Neural Engine.
For post-processing, I've found that running the generated SRT files through ChatGPT to identify and remove hallucination chunks has a better yield.
adzm|5 months ago
bnmoch3|5 months ago
eevmanu|5 months ago
I think latest version of ffmpeg could use whisper with VAD[1], but I still need to explore how with a simple PoC script
I'd love to know more about the post-processing prompt, my guess is that looks like an improved version of `semantic correction` prompt[2], but I may be wrong ¯\_(ツ)_/¯ .
[1] https://ffmpeg.org/ffmpeg-filters.html#toc-whisper-1
[2] https://gist.github.com/eevmanu/0de2d449144e9cd40a563170b459...