top | item 44645506

(no title)

ilyakaminsky | 7 months ago

I wouldn't describe it as "unusable" so much as needing to understand its constraints and how to work around them. I built a business on top of Whisper [1] and one of the early key insights was to implement a good voice activity detection (VAD) model in order to reduce Whisper's hallucinations on silence.

[1] https://speechischeap.com

discuss

order

poly2it|7 months ago

How does this make a profit? Whisper should be $0.006 to $0.010 per minute, but you rate less than $0.001? Do you 10x the audio?

ilyakaminsky|7 months ago

Thanks for noticing. It took a lot of effort to optimize the pipeline every step of the way. VAD, inference server, hardware optimization, etc. But nothing that would compromise on quality. The audio is currently transcribed in its original speed. I'll be sure to publish something if I manage to speed it up without incurring any losses to the WER.