If you want to try out the voice cloning yourself you can do that an this Hugging Face demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS - switch to the "Voice Clone" tab, paste in some example text and use the microphone option to record yourself reading that text - then paste in other text and have it generate a version of that read using your voice.
This is terrifying. With this and z-image-turbo, we've crossed a chasm. And a very deep one. We are currently protected by screens, we can, and should assume everything behind a screen is fake unless rigorously (and systematically, i.e. cryptographically) proven otherwise. We're sleepwalking into this, not enough people know about it.
The HF demo space was overloaded, but I got the demo working locally easily enough. The voice cloning of the 1.7B model captures the tone of the speaker very well, but I found it failed at reproducing the variation in intonation, so it sounds like a monotonous reading of a boring text.
I presume this is due to using the base model, and not the one tuned for more expressiveness.
edit: Or more likely, the demo not exposing the expressiveness controls.
The 1.7B model was much better at ignoring slight background noise in the reference audio compared to the 0.6B model though. The 0.6B would inject some of that into the generated audio, whereas the 1.7B model would not.
Also, without FlashAttention it was dog slow on my 5090, running at 0.3X realtime with just 30% GPU usage. Though I guess that's to be expected. No significant difference in generation speed between the two models.
Overall though, I'm quite impressed. I haven't checked out all the recent TTS models, but a fair number, and this one is certainly one of the better ones in terms of voice cloning quality I've heard.
Remarkable tech that is now accessible to almost anyone. My cloned voice sounded exactly like me. The uses for this will be from good to bad and everywhere in-between. A deceased grandmother reading "Good Night Moon" to grandkids, scamming people, the ability to create podcasts with your own voices from just prompts.
I got some errors trying to run this on my MBP. Claude was able to one-shot a fix.
```
Loaded speech tokenizer from ~/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-1.7B-VoiceDesign/snapshots/0e711a1c0aa5aad30654426
e0d11f67716c1211e/speech_tokenizer
Fetching 11 files: 0%| | 0/11 [00:00<?, ?it/s]Fetching 11 files: 100%|| 11/11 [00:00<00:00, 125033.45it/s]
The tokenizer you are loading from
'!/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-1.7B-VoiceDesign/snapshots/0e711a1c0aa5aad30654426e0d11f67716c1211e' with an
incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instr.... This will
lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
```
I cloned my voice and had it generate audio for a paragraph from something I wrote. It definitely kind of sounds like me, but I like it much better than listening to my real voice. Some kind of uncanny peak.
If i am ever in the same city as you, i'll buy you dinner. I poked around during my free time today trying to figure out how to run these models, and here is the estimable Simon Willison just presenting it on a platter.
hopefully i can make this work on windows (or linux, i guess).
Interesting model, I've managed to get the 0.6B param model running on my old 1080 and I can generated 200 character chunks safely without going OOM, so I thought that making an audiobook of the Tao Te Ching would be a good test. Unfortunately each snippet varies drastically in quality: sometimes the speaker is clear and coherent, but other times it bursts out laughing or moaning. In a way it feels a bit like magical roulette, never being quite certain of what you're going to get. It does have a bit of charm, when you chain the various snippets together you really don't know what direction it's gonna go.
Using speaker Ryan seems to be the most consistent, I tried speaker Eric and it sounded like someone putting on a fake exaggerated Chinese accent to mock speakers.
If it wasn't for the unpredictable level of emotions from each chunk, I'd say this is easily the highest quality TTS model I've tried.
Have you tried specifying the emotion? There's an option to do so and if it's left empty it wouldn't surprise me if it defaulted to rng instead of bland.
it isn't often that tehcnology gives me chills, but this did it. I've used "AI" TTS tools since 2018 or so, and i thought the stuff from two years ago was about the best we were going to get. I don't know the size of these, i scrolled to the samples. I am going to get the models set up somewhere and test them out.
Now, maybe the results were cherrypicked. i know everyone else who has released one of these cherrypicks which to publish. However, this is the first time i've considered it plausible to use AI TTS to remaster old radioplays and the like, where a section of audio is unintelligible but can be deduced from context, like a tape glitch where someone says "HEY [...]LAR!" and it's an episode of Yours Truly, Johnny Dollar...
I have dozens of hours of audio of like Bob Bailey and people of that era.
Indeed, I have a future project/goal of "restoring" Have Gun - Will Travel radio episodes to listenable quality using tech like this. There are so many lines where sound effects and tape rot and other "bad recording" things make it very difficult to understand what was sad. Will be amazing, but as with all tech the potential for abuse is very real
Have you tested alternatives? I grabbed Open Code and a Minimax m2.1 subscription, even just the 10usd/mo one to test with.
Result? We designed a spec for a slight variation of a tool for which I wrote a spec with Claude - same problem (process supervisor tool), from scratch.
Honestly, it worked great, I have played a little further with generating code (this time golang), again, I am happy.
With a good harness I am getting similar results with GLM 4.7. I am paying for TWO! max accounts and my agents are running 24/7.
I still have a small Claude account to do some code reviews. Opus 4.5 does good reviews but at this point GLM 4.7 usually can do the same code reviews.
If cost is an issue (for me it is, I pay out of pocket) go with GLM 4.7
The Chinese labs distill the SOTA models to boost the performance of theirs. They are a trailer hooked up (with a 3-6 month long chain) to the trucks pushing the technology forwards. I've yet to see a trailer overtake it's truck.
China would need an architectural breakthrough to leap American labs given the huge compute disparity.
I could say the same about grok (although given there are better models for my use cases I don't use it). What part of divisive politics are you talking about here?
In my tests this doesn't come close to the years old coqui/XTTS-v2. It has great voice cloning capabilities and creates rich speech with emotions with low latency. I tried out several local-TTS projects over the years but i'm somewhat confused that nothing seems to be able to match coqui despite the leaps that we see in other areas of AI. Can somebody with more knowledge in this field explain why that might be? Or am i completely missing something?
Amusingly one of their examples (the final Age Control example) is prompted to have American English as an accent, but sounds like an Australian trying to sounds American to my ear haha
Has anyone successfully run this on a Mac? The installation instructions appear to assume an NVIDIA GPU (CUDA, FlashAttention), and I’m not sure whether it works with PyTorch’s Metal/MPS backend.
FWIW you can run the demo without FlashAttention using --no-flash-attn command-line parameter, I do that since I'm on Windows and haven't gotten FlashAttention2 to work.
I can't quite figure this out: Can you save a generated voice for reuse later? The mlx-audio I looked at seems to take the text itself in every interface and doesn't expose it as a separate object. (I can dive deeper, but wanted to check if anyone's done it already)
Curious how it compares to last week’s release of Kyutai’s Pocket-TTS [1] which is just 100M params, and excellent in both speed and quality (English only). I use it in my voice plugin [2] for quick voice updates in Claude Code.
Is there any way to take a cloned voice model and plug into Android TTS and/or Windows?
I have a friend with a paralysed larynx who is often using his phone or a small laptop to type in order to communicate. I know he would love it if it was possible to take old recordings of him speaking and use that to give him back "his" voice, at least in some small measure.
Haha something that I want to try out. I have started using voice input more and more instead of typing and now I am on my second app and second TTS model, namely Handy and Parakeet V3.
Parakeet is pretty good, but there are times it struggles. Would be interesting to see how Qwen compares once Handy has it in.
I see a lot of references to `device_map="cuda:0"` but no cuda in the github repo, is the complete stack flash attention plus this python plus the weights file, or does one need vLLM running as well?
I suspect they might be using voice lines from Chinese gacha games in addition to what clearly sound like VTubers, YouTubers, and Chinese TV documentary narrations. Those games all come with clean monaural CN/JP/EN files consistent in contents across language for all regions, for, an obvious[1] reason.
Well, if you look at the prompts, they are basically told to sound like that.
And if you ask me, I think these models were trained on tween fiction podcasts. (My kids listen to a lot of these and dramatic over-acting seems to be the industry standard.)
Also, their middle-aged adult with an "American English" accent sounds like any American I've ever met. More like a bad Sean Connery impersonator.
The real value I see is being able to clone a voice and change timbre and characteristics of the voice to be able to quickly generate voice overs, narrations, voice acting, etc. It's superb!
Can anyone please provide directions/links to tools that can be run locally, and that take an audio recording of a voice as an input, and produce an output with the same voice saying the same thing with the same intonations, but with a fixed/changed accent?
This is needed for processing an indie game's voice recordings, where the voice actors weren't native speakers and had some accent.
Honestly, this seems like it could be pretty cool for video games. I always liked Oblivion's 'Radiant AI', this could be a natural progression, give characters motivations, relations with the player and other NPCs and have an LLM spit out background dialogue, then have another model generate the audio.
simonw|1 month ago
I shared a recording of audio I generated with that here: https://simonwillison.net/2026/Jan/22/qwen3-tts/
javier123454321|1 month ago
magicalhippo|1 month ago
I presume this is due to using the base model, and not the one tuned for more expressiveness.
edit: Or more likely, the demo not exposing the expressiveness controls.
The 1.7B model was much better at ignoring slight background noise in the reference audio compared to the 0.6B model though. The 0.6B would inject some of that into the generated audio, whereas the 1.7B model would not.
Also, without FlashAttention it was dog slow on my 5090, running at 0.3X realtime with just 30% GPU usage. Though I guess that's to be expected. No significant difference in generation speed between the two models.
Overall though, I'm quite impressed. I haven't checked out all the recent TTS models, but a fair number, and this one is certainly one of the better ones in terms of voice cloning quality I've heard.
pseudosavant|1 month ago
parentheses|1 month ago
``` Loaded speech tokenizer from ~/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-1.7B-VoiceDesign/snapshots/0e711a1c0aa5aad30654426 e0d11f67716c1211e/speech_tokenizer Fetching 11 files: 0%| | 0/11 [00:00<?, ?it/s]Fetching 11 files: 100%|| 11/11 [00:00<00:00, 125033.45it/s] The tokenizer you are loading from '!/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-1.7B-VoiceDesign/snapshots/0e711a1c0aa5aad30654426e0d11f67716c1211e' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instr.... This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. ```
cristoperb|1 month ago
mohsen1|1 month ago
What am I doing wrong?
KolmogorovComp|1 month ago
kingstnap|1 month ago
itsTyrion|1 month ago
simonw|1 month ago
Here's the script I'm using: https://github.com/simonw/tools/blob/main/python/q3_tts.py
You can try it with uv (downloads a 4.5GB model on first run) like this:
genewitch|1 month ago
hopefully i can make this work on windows (or linux, i guess).
thanks so much.
rahimnathwani|1 month ago
indigodaddy|1 month ago
gcr|1 month ago
TheAceOfHearts|1 month ago
Using speaker Ryan seems to be the most consistent, I tried speaker Eric and it sounded like someone putting on a fake exaggerated Chinese accent to mock speakers.
If it wasn't for the unpredictable level of emotions from each chunk, I'd say this is easily the highest quality TTS model I've tried.
KaoruAoiShiho|1 month ago
dsrtslnd23|1 month ago
genewitch|1 month ago
Now, maybe the results were cherrypicked. i know everyone else who has released one of these cherrypicks which to publish. However, this is the first time i've considered it plausible to use AI TTS to remaster old radioplays and the like, where a section of audio is unintelligible but can be deduced from context, like a tape glitch where someone says "HEY [...]LAR!" and it's an episode of Yours Truly, Johnny Dollar...
I have dozens of hours of audio of like Bob Bailey and people of that era.
kamranjon|1 month ago
freedomben|1 month ago
unknown|1 month ago
[deleted]
throwaw12|1 month ago
Although I like the model, I don't like the leadership of that company and how close it is, how divisive they're in terms of politics.
mortsnort|1 month ago
pseudony|1 month ago
Have you tested alternatives? I grabbed Open Code and a Minimax m2.1 subscription, even just the 10usd/mo one to test with.
Result? We designed a spec for a slight variation of a tool for which I wrote a spec with Claude - same problem (process supervisor tool), from scratch.
Honestly, it worked great, I have played a little further with generating code (this time golang), again, I am happy.
Beyond that, Glm4.7 should also be great.
See https://dev.to/kilocode/open-weight-models-are-getting-serio...
It is a recent case story of vibing a smaller tool with kilo code, comparing output from minimax m2.1 and Glm4.7
Honestly, just give it a whirl - no need to send money to companies/nations your disagree with with.
TylerLives|1 month ago
What do you mean by this?
mohsen1|1 month ago
I still have a small Claude account to do some code reviews. Opus 4.5 does good reviews but at this point GLM 4.7 usually can do the same code reviews.
If cost is an issue (for me it is, I pay out of pocket) go with GLM 4.7
amrrs|1 month ago
WarmWash|1 month ago
China would need an architectural breakthrough to leap American labs given the huge compute disparity.
Onavo|1 month ago
aussieguy1234|1 month ago
sampton|1 month ago
chriswep|1 month ago
girvo|1 month ago
rahimnathwani|1 month ago
magicalhippo|1 month ago
turnsout|1 month ago
Lichtso|1 month ago
javier123454321|1 month ago
PunchyHamster|1 month ago
jacquesm|1 month ago
bigyabai|1 month ago
viraptor|1 month ago
akadeb|1 month ago
d4rkp4ttern|1 month ago
[1] https://github.com/kyutai-labs/pocket-tts
[2] https://github.com/pchalasani/claude-code-tools?tab=readme-o...
satvikpendem|1 month ago
anotherevan|1 month ago
I have a friend with a paralysed larynx who is often using his phone or a small laptop to type in order to communicate. I know he would love it if it was possible to take old recordings of him speaking and use that to give him back "his" voice, at least in some small measure.
thedangler|1 month ago
dust42|1 month ago
daliusd|1 month ago
There are some samples. If you have GPU you might want to fork and improve this, but otherwise slow, but usable on CPU as well.
7777777phil|1 month ago
indigodaddy|1 month ago
andhuman|1 month ago
quinncom|1 month ago
gunalx|1 month ago
whinvik|1 month ago
Parakeet is pretty good, but there are times it struggles. Would be interesting to see how Qwen compares once Handy has it in.
Footprint0521|1 month ago
woodson|1 month ago
lostmsu|1 month ago
JonChesterfield|1 month ago
khimaros|1 month ago
naveen-zerocool|1 month ago
albertwang|1 month ago
numpad0|1 month ago
1: https://old.reddit.com/r/ZenlessZoneZero/comments/1gqmtl1/th...
rapind|1 month ago
100% I was thinking the same thing.
bityard|1 month ago
And if you ask me, I think these models were trained on tween fiction podcasts. (My kids listen to a lot of these and dramatic over-acting seems to be the industry standard.)
Also, their middle-aged adult with an "American English" accent sounds like any American I've ever met. More like a bad Sean Connery impersonator.
reactordev|1 month ago
devttyeu|1 month ago
thehamkercat|1 month ago
htrp|1 month ago
sails|1 month ago
bigyabai|1 month ago
swaraj|1 month ago
jakobdabo|1 month ago
This is needed for processing an indie game's voice recordings, where the voice actors weren't native speakers and had some accent.
sinnickal|1 month ago
dangoodmanUT|1 month ago
ideashower|1 month ago
subscribed|1 month ago
illwrks|1 month ago
jonkoops|1 month ago
wahnfrieden|1 month ago
numpad0|1 month ago
salzig|1 month ago
Edit: "Cross-lingual Voice Clone" https://qwen.ai/blog?id=qwen3tts-0115#voice-clone
salzig|1 month ago