Personally I prefer StyleTTS 2, and it has a better license. But XTTSv2 has a streaming mode with pretty low latency which is nice. I did run into hallucination issues though. It will hallucinate nonsense words or insert extra syllables in words, pretty frequently.
As others mentioned they shut down so there won't be any updates to XTTS.
They just shared the paper for XTTS, which got accepted to Interspeech and might be the reason for this being posted now: https://arxiv.org/abs/2406.04904
Somewhat unrelated, but given that anyone can vote anonymously, how is the TTS-Arena protecting itself against bots or even rings of humans gaming the system?
NB: Coqui is no longer actively maintained. I’m not sure what the team is up to now. The open market is definitely in need of an upgraded TTS offering; eleven labs is far ahead at the moment.
Not surprising. When I was researching options for a client I tried a few companies including ElevenLabs and Play.ht, each seemed happy to talk to us... except Coqui. I think I went as far as reporting bugs to them, just to have them aggressively ignore me. I guess they're more of a research team than a business?
Coqui is great, but another fantastic tool for TTS I recommend checking out is Piper. The voice quality is great, it's extremely lightweight, and it's fast enough to generate TTS in realtime
https://github.com/rhasspy/piper
I don't know anything about the startup/VC world, but does anyone have insight on why this failed? It seemed to be one of the highest profile TTS projects and I thought money was just pouring into AI startups.
I have a pet ML project that I am doing for fun. I am trying to build a custom transcription and diarizer model for a friend's podcast[1]. My initial solution involved a straight forward implementation using Whisper medium for transcription, and Nemo for diarizing, based on [1]. The results are not bad generally, but since my application involves a fixed set of five known speakers, I thought surely I could fine tune the nemo (or pyannote) diarizer model on their voices to improve accuracy.
Audio samples are easily obtained from their podcast, but manual data labeling is painful for a hobby activity. Further, from what I understand, the real difficulty in performant diarizer models is not speaker recognition generally, but specifically speaker recognition while there is overlapping speech between multiple speakers. I am not even sure how to best implement a labeling procedure for segments with overlapping speech.
I started to wonder whether I might bootstrap a decent sample by leveraging TTS vocal cloning models to simulate the five speakers in dialogues with overlapping speech segments. So I ask HN, is this hopelessly naive, or potentially useful technique? Also, any other advice?
Unclear from docs, does your solution support inferring number of speakers from audio? Found it a bit frustrating that this wasn't automatic in diarization algos I tried last year
We've just opensourced MARS5 and are bullish about it's ability to capture very hard prosody -- hopefully you can validate the results and grow alongside its community.
We tend to agree, the time for just one company to be seriously doing speech is over. It needs to be more diverse, and needs to be opensource
https://github.com/Camb-ai/MARS5-TTS
I absolutely love how good the voices are in the VCTK-VIS dataset (109 of them!). I found it easy to install Coqui on WSL and it is able to use CUDA + the GPU quite effectively. p236 male and p237 female are my choices, but holy cow, 109 quality voices still blows my mind. Crazy how you had to pay for a good TTS just a year ago, but now, it's commoditized. Hope you find this useful:
CUDA_VISIBLE_DEVICES="0" python TTS/server/server.py --model_name tts_models/en/vctk/vits --use_cuda True
def play_sound(response):
#learning : you have to use a semaphore to serialize calls to winsound.PlaySound(), which freaks out with "Failed to play sound" if you try to play 2 clips at once
semaphore.acquire()
try:
winsound.PlaySound(response.content, winsound.SND_MEMORY | winsound.SND_NOSTOP)
finally:
# Always release the permit, even if PlaySound raises an exception
semaphore.release()
While the other commenters provided several voice cloning projects, I would like to point out that I haven't been able to find one that works well for South-American Spanish.
One of my favorite typos. ;) Also coqui is a frog in Puerto Rico (that wound up in Hawaii, sneaking into someone's luggage or something to that effect), when you hear them at night, what you are hearing is their mating call if I remember correctly.
[+] [-] modeless|1 year ago|reply
Personally I prefer StyleTTS 2, and it has a better license. But XTTSv2 has a streaming mode with pretty low latency which is nice. I did run into hallucination issues though. It will hallucinate nonsense words or insert extra syllables in words, pretty frequently.
As others mentioned they shut down so there won't be any updates to XTTS.
[+] [-] eginhard|1 year ago|reply
[+] [-] jsemrau|1 year ago|reply
[+] [-] WhitneyLand|1 year ago|reply
[+] [-] jonahx|1 year ago|reply
[+] [-] vessenes|1 year ago|reply
[+] [-] eginhard|1 year ago|reply
[+] [-] personjerry|1 year ago|reply
[+] [-] phyce|1 year ago|reply
[+] [-] dv35z|1 year ago|reply
[+] [-] huskyr|1 year ago|reply
[+] [-] nishithfolly|1 year ago|reply
[+] [-] ks2048|1 year ago|reply
[+] [-] satvikpendem|1 year ago|reply
[0] https://news.ycombinator.com/item?id=40616438
[+] [-] SubiculumCode|1 year ago|reply
Audio samples are easily obtained from their podcast, but manual data labeling is painful for a hobby activity. Further, from what I understand, the real difficulty in performant diarizer models is not speaker recognition generally, but specifically speaker recognition while there is overlapping speech between multiple speakers. I am not even sure how to best implement a labeling procedure for segments with overlapping speech.
I started to wonder whether I might bootstrap a decent sample by leveraging TTS vocal cloning models to simulate the five speakers in dialogues with overlapping speech segments. So I ask HN, is this hopelessly naive, or potentially useful technique? Also, any other advice?
[1] https://www.3d6downtheline.com/ [2] https://github.com/MahmoudAshraf97/whisper-diarization/
[+] [-] tarasglek|1 year ago|reply
[+] [-] ackprakhack|1 year ago|reply
We tend to agree, the time for just one company to be seriously doing speech is over. It needs to be more diverse, and needs to be opensource https://github.com/Camb-ai/MARS5-TTS
[+] [-] BenRacicot|1 year ago|reply
[+] [-] vijucat|1 year ago|reply
[+] [-] ritonlajoie|1 year ago|reply
[+] [-] probably_wrong|1 year ago|reply
[+] [-] eginhard|1 year ago|reply
[+] [-] willwade|1 year ago|reply
[+] [-] mttpgn|1 year ago|reply
[+] [-] roskoez|1 year ago|reply
I've been using Dimio's Speech for a decade now, but it seems silly now that much better voices exist.
[+] [-] robotburrito|1 year ago|reply
[+] [-] nextworddev|1 year ago|reply
[+] [-] spacemanspiff01|1 year ago|reply
[+] [-] giancarlostoro|1 year ago|reply
[+] [-] Kerbonut|1 year ago|reply
[+] [-] Jayakumark|1 year ago|reply
[+] [-] sa-code|1 year ago|reply