You can do the same thing with Firefox' Reader Mode. On Linux you have to set up speech-dispatcher to use your favorite TTS as a backend.Once it is set up, there will be an option to listen the page.
Oh this is sweet, thanks for sharing! I've been a huge fan of Kokoro and event setup my own fully-local voice assistant [1]. Will definitely give Pocket TTS a go!
macOS already has some great intrinsic TTS capability as the OS seems to include a naturally sounding voice. I recently built a similar tool to just run the "say" command as a background process. Had to wrap it in a Deno server. It works, but with Tahoe it's difficult to consistently configure using that one natural voice, and not the subpar voices downloadable in the settings. The good voice seems to be hidden somehow.
It says MIT license but then readme has a separate section on prohibited use that maybe adds restrictions to make it nonfree? Not sure the legal implications here.
For reference, the MIT license contains this text: "Permission is hereby granted... to deal in the Software without restriction, including without limitation the rights to use". So the README containing a "Prohibited Use" section definitely creates a conflicting statement.
Tried to use voice cloning but in order to download the model weights I have to create a HuggingFace account, connect it on the command line, give them my contact information, and agree to their conditions. The open source part is just the client and chunking logic which is pretty minimal.
From my understanding, the code is MIT, but the model isn't? What consitutes a "Software" anyway? Aren't resources like images, sounds and the likes exempt from it (hence, covered by usual copyright unless separately licensed)? If so, in the same vein, an ML model is not part of "Software". By the way, the same prohibition is repeated on the huggingface model card.
I'm psyched to see so much interest in my post about Kyutai's latest model! I'm working on part of a related team in Paris that's building off Kutai's research to provide enterprise-grade voice solutions. If anyone building in this space I'd love to chat and share some our upcoming models and capabilities that I am told are SOTA. Please don't hesitate to ping me via the address in my profile.
Woah, I'm impressed! The voice cloning also worked much better than expected! Will there be separate models for other languages? I know the National Library in Norway has done a good job curating speech datasets with many different dialects [1][2].
So, on my M1 mac, did `uvx pocket-tts serve`. Plugged in
> It was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only
(Beginning of Tale of Two Cities)
but the problem is Javert skips over parts of sentences! Eg, it starts:
> "It was the best of times, it was the worst of times, it was the age of
wisdom, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the spring of hope, it was the winter of despair, we had everything before us, ..."
Notice how it skips over "it was the age of foolishness,", "it was the winter of despair,"
Which... Doesn't exactly inspire faith in a TTS system.
All the models I tried have similar problems. When trying to batch a whole audiobook, the only way is to run it, then run a model to transcribe and check you get the same text.
Václav from Kyutai here. Thanks for the bug report! A workaround for now is to chunk the text into smaller parts where the model is more reliable. We already do some chunking in the Python package. There is also a more fancy way to do this chunking in a way that ensures that the stitched-together parts continue well (teacher-forcing), but we haven't implemented that yet.
I love that everyone is making their own TTS model as they are not as expensive as many other models to train. Also there are plenty of different architecture.
Is there any TTS engine that doesn't need cloning and has some sort of parameters one can specify?
Like what if I want to graft on TTS to an existing text chat system and give each person an unique, randomly generated voice? Or want to try to get something that's not quite human, like some sort of alien or monster?
You could use an old-school formant synthesizer that lets you tune the parameters, like espeak or dectalk. espeak apparently has a klatt mode which might sound better than the default but i haven't tried it.
Perhaps I have been not talking to voice models that much or the chatgpt voice always felt weird and off because I was thinking it goes to a cloud server and everything but from Pocket TTS I discovered unmute.sh which is open source and I think is from the same company as Pocket TTS/can I think use Pocket TTS as well
I saw some agentic models at 4B or similar which can punch above its weights or even some basic models. I can definitely see them in the context of home lab without costing too much money.
I think atleast unmute.sh is similar/competed with chatgpt's voice model. It's crazy how good and (effective) open source models are from top to bottom. There's basically just about anything for almost everyone.
I feel like the only true moat might exist in coding models. Some are pretty good but its the only industry where people might pay 10x-20x more for the best (minimax/z.ai subscription fees vs claude code)
It will be interesting to see if we will see another deepseek moment in AI which might beat claude sonnet or similar. I think Deepseek has deepseek 4 so it will be interesting to see how/if it can beat sonnet
I am also quite irritated by the fact that many TTS fail to state what language (and probably even dialect) they support. Actually to support a really good workflow for many Europeans (and probably also the rest of the world) one would actually need a multi language models that also support the use foreign words within one's own language. I am using a local notification reader on my smartphone (with SherpaTTS) and the mix of notification language as well as languages embedded in each other makes the experience rather funny at times.
I echo this. For a TTS system to be in any way useful outside the tiny population of the world that speaks exclusively English, it must be multilingual and dynamically switch between languages pretty much per word.
It's impressive but it's a shame that it's 2026 and despite remarkably lifelike speech, so many models fall on common issues like heteronyms ("the couple had a row because they couldn't agree where to row their boat"), realistic number handling and so on.
The speed of improvement of tts models reminds me of early days of Stable Diffusion. Can't wait until I can generate audiobooks without infinite pain. If I was an investor I'd short Audible.
It's not perfect, but I already have a setup for doing this on my phone. Add SherpaTTS and Librera Reader to your phone. (both available free on fdroid).
Set up SherpaTTS as the voice model for your phone (I like the en_GB-jenny_dioco-medium voice option, but there are several to choose from). Add a ebook to librera reader and open it. There's an icon with a little person wearing headphones, which lets you send the text continuously to your phone's tts, using just local processing on the phone. I don't have the latest phone but mine is able to process it faster than the audio is read, so the audio doesn't stop and start.
The voice isn't totally human sounding, but it's a lot better than the microsoft sam days, and once you get used to it the roboticness fades into the background and I can just listen to the story. You may get better results with kokoro (I couldn't get it running on my phone) or similar tts engines and a more powerful phone.
One thing I like about this setup is that if you want to swap back and forth between audio and text, you can. The reader scrolls automatically as it makes the audio, and you can pause it, read in silence for a while yourself and later set it going from a new point.
I feel like TTS is one of the areas that as evolved the least. Small TTS models have been around for like 5+ years and they've only gotten incrementally better. Giants like ElevenLabs make good sounding TTS but it's not quite human yet and the improvements get less and less each iteration.
for example, How much disk is needed? I started the uvx command and it started to download hundreds of megabytes. How much cpu ram is necessary and how much gpu ram is necessary? will an integrated intel gpu work? some ARM boards have a dedicated AI processor, are any of those supported?
This is amazing. The audio feels very natural and it's fairly good at handling complext text to speech tasks.
I've been working on WithAudio (https://with.audio). Currently it only uses Kokoros. I need to test this a bit more but I might actually add it to the app. It's too good to be ignored.
It's very impressive!
I'm mean, it's better than other <200M TTS models I encounter.
In English, it's perfect and it's so funny in others languages. It sounds exactly like someone who actually doesn't speak the language, but got it anyway.
I don't know why Fantine is just better than the others in others languages. Javer seems to be the worst.
Try Jean in Spanish « ¡Es lo suficientemente pequeño como para caber en tu bolsillo! » sound a lot like they don't understand the language.
Or Azelma in French « C'est suffisament petit pour tenir dans ta poche. » is very good.I mean half of the words are from a Québécois accent, half French one but hey, it's correct French.
Gabriel from Kyutai here, we do support outputting wav to stdout. We don't support reading text from stdin but that should be easy enough. Feel free to drop a pull request!
Just added it to my codex plugin that reads summary of what it finishes after each turn and I am spooked! runs well on my macbook, much better than Samantha!
Gradium (https://gradium.ai/), a commercial company offshoot of Kyutai (open source lab), are focusing on emotion (both being able to recognise emotion and also understanding what emotion to use depending on context). I don't think any of their public existing models already does that, but they demoed it pretty impressively at the ai-Pulse conference.
Chatterbox does something like that. For example, if the input is
"so and so," he <verb>
and the verb is not just "said", but "chuckled", or "whispered", or "said shakily", the output is modified accordingly, or if there's an indication that it's a woman speaking it may pitch up during the quotation. It also tries to guess emotive content from textual content, such if a passage reads angry it may try to make it sound angry. That's more hit-and-miss, but when it hits, it hits really well. A very common failure case is, imagine someone is trying to psych themselves up and they say internally "come on, Steve, stand up and keep going", it'll read it in a deeper voice like it was being spoken by a WW2 sergeant to a soldier.
But there seems to be a bug maybe? Just for fun, I had asked it to play the Real Slim Shady lyrics. It always seems to add 1 extra "please stand-up" in the chorus. Anyone see that?
Hello Gabriel from Kyutai here, maybe it's related to the way we chunk the text? Can you post an issue on github with the extact text and voice? I'll take a look.
Václav from Kyutai here. Yes the original naming scheme was from Les Miserables, glad you noticed! We just stuck to Alba because that's the real name of the voice actor that provided the voice sample to us (see https://huggingface.co/kyutai/tts-voices), the other ones are either from pre-existing datasets or given anonymously.
[3] (2021 https://arxiv.org/abs/2106.07889) UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation
>If you want access to the model with voice cloning, go to https://huggingface.co/kyutai/pocket-tts and accept the terms, then make sure you're logged in locally with `uvx hf auth login`
lol
I’ve tried the voice clinking and it works great. I added a 9s clip and it captured the speaker pretty well.
But don’t do the fake mistake I did and use a hf token that doesn’t have access to read from repos! The error message said that I had to request access to the repo, but I’ve had already done that, so I couldn’t figure out what was wrong. Turns out my HF token only had access to inference.
derHackerman|1 month ago
https://github.com/lukasmwerner/pocket-reader
laszbalo|1 month ago
armcat|1 month ago
[1] https://github.com/acatovic/ova
gropo|1 month ago
For voice cloning, pocket tts is walled so I can't tell
amrrs|1 month ago
lukebechtel|1 month ago
Just made it an MCP server so claude can tell me when it's done with something :)
https://github.com/Marviel/speak_when_done
tarcon|1 month ago
tylerdavis|1 month ago
codepoet80|1 month ago
singpolyma3|1 month ago
It says MIT license but then readme has a separate section on prohibited use that maybe adds restrictions to make it nonfree? Not sure the legal implications here.
CGamesPlay|1 month ago
jandrese|1 month ago
Buttons840|1 month ago
If a license says "you may use this, you are prohibited from using this", and I use it, did I break the license?
MatthiasPortzel|1 month ago
syockit|1 month ago
iamrobertismo|1 month ago
pain_perdu|1 month ago
rsolva|1 month ago
[1] https://data.norge.no/en/datasets/220ef03e-70e1-3465-a4af-ed...
[2] https://ai.nb.no/datasets/
armcat|1 month ago
mgaudet|1 month ago
So, on my M1 mac, did `uvx pocket-tts serve`. Plugged in
> It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only
(Beginning of Tale of Two Cities)
but the problem is Javert skips over parts of sentences! Eg, it starts:
> "It was the best of times, it was the worst of times, it was the age of wisdom, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the spring of hope, it was the winter of despair, we had everything before us, ..."
Notice how it skips over "it was the age of foolishness,", "it was the winter of despair,"
Which... Doesn't exactly inspire faith in a TTS system.
(Marius seems better; posted https://github.com/kyutai-labs/pocket-tts/issues/38)
Paul_S|1 month ago
sbarre|1 month ago
- "its noisiest superlative insisted on its being received"
Win10 RTX 5070 Ti
vvolhejn|1 month ago
small_scombrus|1 month ago
I wonder what's going wrong in there
memming|1 month ago
GaggiX|1 month ago
Another recent example: https://github.com/supertone-inc/supertonic
andai|1 month ago
https://huggingface.co/spaces/Supertone/supertonic-2
coder543|1 month ago
It seems like it is being trained by one person, and it is surprisingly natural for such a small model.
I remember when TTS always meant the most robotic, barely comprehensible voices.
https://www.reddit.com/r/LocalLLaMA/comments/1qcusnt/soprano...
https://huggingface.co/ekwek/Soprano-1.1-80M
nunobrito|1 month ago
nowittyusername|1 month ago
NoSalt|1 month ago
Ok, who knows where I can get those high-quality recordings of Majel Barrett' voice that she made before she died?
freedomben|1 month ago
dale_glass|1 month ago
Like what if I want to graft on TTS to an existing text chat system and give each person an unique, randomly generated voice? Or want to try to get something that's not quite human, like some sort of alien or monster?
unleaded|1 month ago
bkitano19|1 month ago
Evidlo|1 month ago
homarp|1 month ago
Imustaskforhelp|1 month ago
I saw some agentic models at 4B or similar which can punch above its weights or even some basic models. I can definitely see them in the context of home lab without costing too much money.
I think atleast unmute.sh is similar/competed with chatgpt's voice model. It's crazy how good and (effective) open source models are from top to bottom. There's basically just about anything for almost everyone.
I feel like the only true moat might exist in coding models. Some are pretty good but its the only industry where people might pay 10x-20x more for the best (minimax/z.ai subscription fees vs claude code)
It will be interesting to see if we will see another deepseek moment in AI which might beat claude sonnet or similar. I think Deepseek has deepseek 4 so it will be interesting to see how/if it can beat sonnet
(Sorry for going offtopic)
StevenNunez|1 month ago
dust42|1 month ago
riedel|1 month ago
jiehong|1 month ago
I think they should have added the fact that it's English only in the title at the very least.
phoronixrly|1 month ago
Cool tech demo though!
nmstoker|1 month ago
woadwarrior01|1 month ago
akx|1 month ago
All too often, new models' codebases are just a dump of code that installs half the universe in dependencies for no reason, etc.
snvzz|1 month ago
Paul_S|1 month ago
asystole|1 month ago
everyday7732|1 month ago
Set up SherpaTTS as the voice model for your phone (I like the en_GB-jenny_dioco-medium voice option, but there are several to choose from). Add a ebook to librera reader and open it. There's an icon with a little person wearing headphones, which lets you send the text continuously to your phone's tts, using just local processing on the phone. I don't have the latest phone but mine is able to process it faster than the audio is read, so the audio doesn't stop and start.
The voice isn't totally human sounding, but it's a lot better than the microsoft sam days, and once you get used to it the roboticness fades into the background and I can just listen to the story. You may get better results with kokoro (I couldn't get it running on my phone) or similar tts engines and a more powerful phone.
One thing I like about this setup is that if you want to swap back and forth between audio and text, you can. The reader scrolls automatically as it makes the audio, and you can pause it, read in silence for a while yourself and later set it going from a new point.
gempir|1 month ago
StevenNunez|1 month ago
rowanG077|1 month ago
donpdonp|1 month ago
donpdonp|1 month ago
d4rkp4ttern|1 month ago
claude plugin marketplace add pchalasani/claude-code-tools
claude plugin install voice@cctools-plugins
More here: https://github.com/pchalasani/claude-code-tools?tab=readme-o...
febin|1 month ago
https://github.com/jamesfebin/pocket-tts-candle
The port supports:
- Native compilation with zero Python runtime dependency
- Streaming inference
- Metal acceleration for macOS
- Voice cloning (with the mimi feature)
Note: This was vibecoded (AI-assisted), but features were manually tested.
OfflineSergio|1 month ago
unknown|1 month ago
[deleted]
syntaxing|1 month ago
daemonologist|1 month ago
phoronixrly|1 month ago
smallerfish|1 month ago
sysworld|1 month ago
tschellenbach|1 month ago
britannio|1 month ago
https://gist.github.com/britannio/481aca8cb81a70e8fd5b7dfa2f...
Zardoz84|1 month ago
_ache_|1 month ago
In English, it's perfect and it's so funny in others languages. It sounds exactly like someone who actually doesn't speak the language, but got it anyway.
I don't know why Fantine is just better than the others in others languages. Javer seems to be the worst.
Try Jean in Spanish « ¡Es lo suficientemente pequeño como para caber en tu bolsillo! » sound a lot like they don't understand the language.
Or Azelma in French « C'est suffisament petit pour tenir dans ta poche. » is very good.I mean half of the words are from a Québécois accent, half French one but hey, it's correct French.
Però non capisce l'italiano.
lykahb|1 month ago
gabrieldemarm|1 month ago
agentifysh|1 month ago
https://github.com/agentify-sh/speak/
gabrieldemarm|1 month ago
[deleted]
exceptione|1 month ago
sofixa|1 month ago
fluoridation|1 month ago
"so and so," he <verb>
and the verb is not just "said", but "chuckled", or "whispered", or "said shakily", the output is modified accordingly, or if there's an indication that it's a woman speaking it may pitch up during the quotation. It also tries to guess emotive content from textual content, such if a passage reads angry it may try to make it sound angry. That's more hit-and-miss, but when it hits, it hits really well. A very common failure case is, imagine someone is trying to psych themselves up and they say internally "come on, Steve, stand up and keep going", it'll read it in a deeper voice like it was being spoken by a WW2 sergeant to a soldier.
butz|1 month ago
butz|1 month ago
aki237|1 month ago
I just tried some sample verses, sounds natural.
But there seems to be a bug maybe? Just for fun, I had asked it to play the Real Slim Shady lyrics. It always seems to add 1 extra "please stand-up" in the chorus. Anyone see that?
gabrieldemarm|1 month ago
aidenn0|1 month ago
vvolhejn|1 month ago
kreelman|1 month ago
indigodaddy|1 month ago
g947o|1 month ago
dhruvdh|1 month ago
bboplifa|1 month ago
anonymous344|1 month ago
maxglute|1 month ago
unknown|1 month ago
[deleted]
grahamrr|1 month ago
gabrieldemarm|1 month ago
[deleted]
fuzzer371|1 month ago
rhdunn|1 month ago
[1] (2016 https://arxiv.org/abs/1609.03499) WaveNet: A Generative Model for Raw Audio
[2] (2017 https://arxiv.org/abs/1711.10433) Parallel WaveNet: Fast High-Fidelity Speech Synthesis
[3] (2021 https://arxiv.org/abs/2106.07889) UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation
[4] (2022 https://arxiv.org/abs/2203.14941) Neural Vocoder is All You Need for Speech Super-resolution
X-Ryl669|1 month ago
oybng|1 month ago
andhuman|1 month ago
But don’t do the fake mistake I did and use a hf token that doesn’t have access to read from repos! The error message said that I had to request access to the repo, but I’ve had already done that, so I couldn’t figure out what was wrong. Turns out my HF token only had access to inference.
tempaccountabcd|1 month ago
[deleted]