I'm so amazed to find out just how close we are to the start trek voice computer.
I used to use Dragon Dictation to draft my first novel, had to learn a 'language' to tell the rudimentary engine how to recognize my speech.
And then I discovered [1] and have been using it for some basic speech recognition, amazed at what a local model can do.
But it can't transcribe any text until I finish recording a file, and then it starts work, so very slow batches in terms of feedback latency cycles.
And now you've posted this cool solution which streams audio chunks to a model in infinite small pieces, amazing, just amazing.
Now if only I can figure out how to contribute to Handy or similar to do that Speech To Text in a streaming mode, STT locally will be a solved problem for me.
Thank you for sharing! Does your implementation allow running the Nemotron model on Vulkan? Like whisper.cpp? I'm curious to try other models, but I don't have Nvidia, so my choices are limited.
I’m curious about this too. On my M1 Max MacBook I use the Handy app on macOS with Parakeet V3 and I get near instant transcription, accuracy slightly less than slower Whisper models, but that drop is immaterial when talking to CLI coding agents, which is where I find the most use for this.
Yeah, I think the multilingual improvements in V3 caused some kind of regression for English - I've noticed large blocks occasionally dropped as well, so reverted to v2 for my usage. Specifically nvidia/parakeet-tdt-0.6b-v2 vs nvidia/parakeet-tdt-0.6b-v3
I didn’t see that but I do get a lot of stutters (words or syllables repeated 5+ times), not sure if it’s a model problem or post processing issue in the Handy app.
Parakeet is really good imo too, and it's just 0.6B so it can actually run on edge devices. 4B is massive, I don't see Voxtral running realtime on an Orin or fitting on a Hailo. An Orin Nano probably can't even load it at BF16.
m1el|26 days ago
https://huggingface.co/nvidia/nemotron-speech-streaming-en-0...
https://github.com/m1el/nemotron-asr.cpp https://huggingface.co/m1el/nemotron-speech-streaming-0.6B-g...
Multicomp|26 days ago
I used to use Dragon Dictation to draft my first novel, had to learn a 'language' to tell the rudimentary engine how to recognize my speech.
And then I discovered [1] and have been using it for some basic speech recognition, amazed at what a local model can do.
But it can't transcribe any text until I finish recording a file, and then it starts work, so very slow batches in terms of feedback latency cycles.
And now you've posted this cool solution which streams audio chunks to a model in infinite small pieces, amazing, just amazing.
Now if only I can figure out how to contribute to Handy or similar to do that Speech To Text in a streaming mode, STT locally will be a solved problem for me.
[1] https://github.com/cjpais/Handy
pstroqaty|26 days ago
d4rkp4ttern|26 days ago
https://github.com/cjpais/Handy
tylergetsay|26 days ago
czottmann|26 days ago
cypherpunks01|26 days ago
d4rkp4ttern|26 days ago
WXLCKNO|26 days ago
moffkalast|26 days ago
whinvik|26 days ago