Whisper.api: Open-source, self-hosted speech-to-text with fast transcription

[+] nchudleigh|2 years ago|reply

This is awesome.

For anyone confused about the project, it is using whisper.cpp, a C-based runner and translation of the open whisper model from OpenAI. It is built by the team behind GGML and llama.cpp. https://github.com/ggerganov

You can fork this code, run it on your own server, and hit the API. The server itself will use FFmpeg to convert the audio file into the required format and run the C translation of the whisper model against the file.

By doing this you can separate yourself from the requirement of paying the fee that OpenAI charges for their Whisper service and fully own your translations. The models that the author has supplied here are rather small but should run decent on a CPU. If you want to go to larger model sizes you would likely need to change the compilation options and use a server with a GPU.

Similar to this project, my product https://superwhisper.com is using these whisper.cpp models to provide really good Dictation on macOS.

Its runs really fast on the M series chips. Most of this message was dictated using superwhisper.

Congrats to the author of this project. Seems like a useful implementation of the whisper.cpp project.

I wonder if they would accept it upstream in the examples.

[+] mikeravkine|2 years ago|reply

One caveat here is that whisper.cpp does not offer any CUDA support at all, acceleration is only available for Apple Silicon.

If you have Nvidia hardware the ctranslate2 based faster-whisper is very very fast: https://github.com/guillaumekln/faster-whisper

[+] innovatorved|2 years ago|reply

Many of you are asking if the project is completely self-hosted and does not rely on any third-party services. Yes, it is completely self-hosted and does not rely on any third-party services. The user is for authentication, so no one can use the service without authentication.

[+] mkl|2 years ago|reply

Getting an authentication token does rely on a third-party service, if the README instructions are correct. It requires sending an email address to that third party.

[+] awwaiid|2 years ago|reply

Maybe the auth token example is meant to also hit localhost?

[+] Animats|2 years ago|reply

Huh?

"This project provides an API with user level access support to transcribe speech to text using a finetuned and processed Whisper ASR model."

Why is this a service at all? Why not just a library? Or a subprocess?

[+] innovatorved|2 years ago|reply

Whisper API - Speech to Text Transcription

This open source project provides a self-hostable API for speech to text transcription using a finetuned Whisper ASR model. The API allows you to easily convert audio files to text through HTTP requests. Ideal for adding speech recognition capabilities to your applications.

Key features:

- Uses a finetuned Whisper model for accurate speech recognition - Simple HTTP API for audio file transcription - User level access with API keys for managing usage - Self-hostable code for your own speech transcription service - Quantized model optimization for fast and efficient inference - Open source implementation for customization and transparency

[+] brianjking|2 years ago|reply

What was the fine tune?

How does this compare to what is possible using https://goodsnooze.gumroad.com/l/macwhisper for example?

Thanks!

[+] rrsp|2 years ago|reply

Are you able to provide more information on the fine tuning? Any improvement in WER and what language it was fine tuned in and the size of the dataset used?

[+] atajwala|2 years ago|reply

Any plans to add phrase timestamps, channel separation and other equivalent ASR features to make this API more approachable?

[+] stavros|2 years ago|reply

This looks great, does recognition use the GPU? What's the speed you get on it?

[+] ChrisArchitect|2 years ago|reply

Not to be confused with

Whisper – open source speech recognition by OpenAI https://news.ycombinator.com/item?id=34985848

[+] innovatorved|2 years ago|reply

https://openai.com/research/whisper

[+] 3abiton|2 years ago|reply

I thought that was the same. I still don't see the difference.

[+] edgarvaldes|2 years ago|reply

Related to whisper: whisperX is a god send. I can finally watch old or uncommon tv series with subtitles.

[+] jcims|2 years ago|reply

Oh dang, diarization? How well does it work?

[+] pizzafeelsright|2 years ago|reply

This is not fully self-hosted so much as middle-ware, no?

[+] innovatorved|2 years ago|reply

It is completely self-hosted, but it currently supports only the tiny and base models. You can soon expect support for large models. For any requests, you can create an issue.

[+] unknown|2 years ago|reply

[deleted]

[+] geekodour|2 years ago|reply

Nice! This will be very useful for me. Think I can run this locally can spin a basic telegram bot around it for personal use.

One issue I faced with all the whisper based transcript generators is that there seems to be no good way to make editing/correcting the generated text with word level timestamp. I created a small web based tool[0] for that.

By any chance if anyone is looking to edit transcripts generated using whisper, you'd probably find it useful.

[0] https://github.com/geekodour/wscribe-editor

[+] LeoPanthera|2 years ago|reply

So is "real time" translation a thing yet? I've long wanted to be able watch non-english television and have the audio translated into English subtitles. It's doable for pre-recorded things, but not for live.

An iPhone app that could do this from the microphone would also be amazing. Google Translate and it's various competitors from Microsoft/Apple are nearly there, but they all stop listening inbetween sentences. Something that just listened constantly, printing translated text onto the screen, would be amazing.

[+] innovatorved|2 years ago|reply

Just wait for a couple of weeks. I am working on speech-to-speech translation. Instead of subtitles, you can listen to it directly. I am also working on subtitles.

[+] DarthNebo|2 years ago|reply

For long running stuff https://developer.apple.com/tutorials/app-dev-training/trans... should be straightforward to translate as well using ported on-device BERT models

[+] unknown|2 years ago|reply

[deleted]

[+] videogreg93|2 years ago|reply

I've been using the Microsoft Speech api for an app and so far it's been surprisingly very good for realtime speech to text.

[+] distantsounds|2 years ago|reply

how is this open source, or self-hosted, when it requires an API key and a login from a third party?

[+] innovatorved|2 years ago|reply

No, it is not a third-party. It is a just PostgreSQL database for logging everything. You can simply visit the /docs endpoint. It is just for authentication so that you can work with different users. One Again its completely self hosted

[+] v7n|2 years ago|reply

Many live streamers, and platforms, would love to have custom real-time transcription elements. I actually looked into this exact project of yours when I thought about creating such a thing.

Even if it meant delaying the broadcast for a second while transcribing the accessibility value could be immense.

[+] Dig1t|2 years ago|reply

>Get Your token

If it's completely self-hosted why do I need to get a token? Where does the actual model run?

[+] innovatorved|2 years ago|reply

getToken is just an authentication layer for authenticating your request. If you want to self-host it, just clone the repo and please check the .env.example file.

[+] pdntspa|2 years ago|reply

So Whisper is all the rage with speech-to-text, but what about text-to-speech?

[+] joy_void_joy|2 years ago|reply

[deleted]

[+] grzes|2 years ago|reply

i dont understand the excitement here. it's just a HTTP wrapper for CLI command. you can build it easily on your own with any decent RAD framework

50 comments