StyleTTS2 – open-source Eleven-Labs-quality Text To Speech

modeless|2 years ago

I made a 100% local voice chatbot using StyleTTS2 and other open source pieces (Whisper and OpenHermes2-Mistral-7B). It responds so much faster than ChatGPT. You can have a real conversation with it instead of the stilted Siri-style interaction you have with other voice assistants. Fun to play with!

Anyone who has a Windows gaming PC with a 12 GB Nvidia GPU (tested on 3060 12GB) can install and converse with StyleTTS2 with one click, no fiddling with Python or CUDA needed: https://apps.microsoft.com/detail/9NC624PBFGB7

The demo is janky in various ways (requires headphones, runs as a console app, etc), but it's a sneak peek at what will soon be possible to run on a normal gaming PC just by putting together open source pieces. The models are improving rapidly, there are already several improved models I haven't yet incorporated.

lucubratory|2 years ago

How hard on your end does the task of making the chatbot converse naturally look? Specifically I'm thinking about interruptions, if it's talking too long I would like to be able to start talking and interrupt it like in a normal conversation, or if I'm saying something it could quickly interject something. Once you've got the extremely high speed, theoretically faster than real time, you can start doing that stuff right?

There is another thing remaining after that for fully natural conversation, which is making the AI context aware like a human would be. Basically giving it eyes so it can see your face and judge body language to know if it's talking too long and needs to be more brief, the same way a human talks.

eigenvalue|2 years ago

Tried it but it seems it only works with Cuda 11 and I have 12 installed. Not really willing to potentially screw up my Cuda environment to try it.

shon|2 years ago

Cool work! I tested it and got some mixed results:

1) it throws an error if it's installed to any drive other than C:\ --I moved it to C: and it works fine.

2) I'm seeing huge latency on an EVGA 3080Ti with 12GB. Also seeing it repeat the parsed input, even though I only spoke once, it appears to process the same input many times with slightly different predictions sometimes. Here's some logs:

Latency to LLM response: 4.59 latency to speaking: 5.31 speaking 4: Hi Jim! user spoke: Hi Jim. user spoke recently, prompting LLM. last word time: 77.81 time: 78.11742429999867 latency to prompting: 0.31

Latency to LLM response: 2.09 latency to speaking: 3.83 speaking 5: So what have you been up to lately? user spoke: So what have you been up to lately? user spoke recently, prompting LLM. last word time: 83.9 time: 84.09415280001122 latency to prompting: 0.19 user spoke: So what have you been up to lately? No, I'm watching. user spoke a while ago, ignoring. last word time: 86.9 time: 88.92142140000942 user spoke: So what have you been up to lately? No, just watching TV. user spoke a while ago, ignoring. last word time: 87.9 time: 90.76665070001036 user spoke: So what have you been up to lately? No, I'm just watching TV. user spoke a while ago, ignoring. last word time: 87.9 time: 94.16581820001011 user spoke: So what have you been up to lately? No, I'm just watching TV. user spoke a while ago, ignoring. last word time: 88.9 time: 97.85854300000938 user spoke: So what have you been up to lately? No, I'm just watching TV. user spoke a while ago, ignoring. last word time: 87.9 time: 101.54986060000374 user spoke: No, I just bought you a TV. user spoke a while ago, ignoring. last word time: 87.8 time: 104.51332219998585 user spoke: No, I'll just watch you TV. user spoke a while ago, ignoring. last word time: 87.41 time: 106.60086529998807 Latency to LLM response: 46.09 latency to speaking: 50.49

Thanks for posting it!

Edit:

3) It's hearing itself and responding to itself...

funtech|2 years ago

Is 12GB the minimum? got an out of memory error with 8GB

samsepi0l121|2 years ago

But whisper does not support input streaming, so you have to wait for the whole llm response to trigger the transcription or not?

zestyping|2 years ago

Wow! For those of us who don't have the necessary GPU hardware, can you post a video?

tomp|2 years ago

How do you get Whisper to be fast?

Isn't it quite non-realtime?

aik|2 years ago

Hey modeless. Love it. Is your project open source by any chance? Would love to see it.

xena|2 years ago

It threw a python exception for me and didn't generate speech

lhl|2 years ago

I tested StyleTTS2 last month, my step-by-step notes that might be useful for people doing local setup (not too hard): https://llm-tracker.info/books/howto-guides/page/styletts-2

Also I did a little speed/quality shootoff with the LJSpeech model (vs VITS and XTTS). StyleTTS2 was pretty good and very fast: https://fediverse.randomfoo.net/notice/AaOgprU715gcT5GrZ2

kelseyfrog|2 years ago

> inferences at up to 15-95X (!) RT on my 4090

That's incredible!

Are infill and outpainting equivalents possible? Super-RT TTS at this level of quality opens up a diverse array of uses esp for indie/experimental gamedev that I'm excited for.

rahimnathwani|2 years ago

Thanks. Following the instructions now. BTW mamba is no longer recommended (for those like me who aren't already using it), and the #mambaforge anchor in the link didn't work.

eigenvalue|2 years ago

Was somewhat annoying to get everything to work as the documentation is a bit spotty, but after ~20 minutes it's all working well for me on WSL Ubuntu 22.04. Sound quality is very good, much better than other open source TTS projects I've seen. It's also SUPER fast (at least using a 4090 GPU).

Not sure it's quite up to Eleven Labs quality. But to me, what makes Eleven so cool is that they have a large library of high quality voices that are easy to choose from. I don't yet see any way with this library to get a different voice from the default female voice.

Also, the real special sauce for Eleven is the near instant voice cloning with just a single 5 minute sample, which works shockingly (even spookily) well. Can't wait to have that all available in a fully open source project! The services that provide this as an API are just too expensive for many use cases. Even the OpenAI one which is on the cheaper side costs ~10 cents for a couple thousand word generation.

eigenvalue|2 years ago

To save people some time, this is tested on Ubuntu 22.04 (google is being annoying about the download link, saying too many people have downloaded it in the past 24 hours, but if you wait a bit it should work again):

  git clone https://github.com/yl4579/StyleTTS2.git
  cd StyleTTS2
  python3 -m venv venv
  source venv/bin/activate
  python3 -m pip install --upgrade pip
  python3 -m pip install wheel
  pip install -r requirements.txt
  pip install phonemizer
  sudo apt-get install -y espeak-ng
  pip install gdown
  gdown https://drive.google.com/uc?id=1K3jt1JEbtohBLUA0X75KLw36TW7U1yxq
  7z x Models.zip
  rm Models.zip
  gdown https://drive.google.com/uc?id=1jK_VV3TnGM9dkrIMsdQ_upov8FrIymr7
  7z x Models.zip
  rm Models.zip
  pip install ipykernel pickleshare nltk SoundFile
  python -c "import nltk; nltk.download('punkt')"
  pip install --upgrade jupyter ipywidgets librosa
  python -m ipykernel install --user --name=venv --display-name="Python (venv)"
  jupyter notebook

Then navigate to /Demo and open either `Inference_LJSpeech.ipynb` or `Inference_LibriTTS.ipynb` and they should work.

wczekalski|2 years ago

One thing I've seen done for style cloning is a high quality fine tuned TTS -> RVC pipeline to "enhance" the output. TTS for intonation + pronunciation, RVC for voice texture. With StyleTTS and this pipeline you should get close to ElevenLabs.

sandslides|2 years ago

The LibriTTS demo clones unseen speakers from a five second or so clip

wczekalski|2 years ago

have you tested longer utterances with both ElevenLabs and with StyleTTS? Short audio synthesis is a ~solved problem in the TTS world but things start falling apart once you want to do something like create an audiobook with text to speech.

satvikpendem|2 years ago

Funnily enough, the TTS2 examples sound better than the ground truth [0]. For example, the "Then leaving the corpse within the house [...]" example has the ground truth pronounce "house" weirdly, with some change in the tonality that sounds higher, but the TTS2 version sounds more natural.

I'm excited to use this for all my ePub files, many of which don't have corresponding audiobooks, such as a lot of Japanese light novels. I am currently using Moon+ Reader on Android which has TTS but it is very robotic.

[0] https://styletts2.github.io/

qingcharles|2 years ago

First Wife is a professional voice-over actor. I saw someone left her a bad review saying "Clearly an AI."

2023. There is no way to win.

KolmogorovComp|2 years ago

The pace is better, but imho you there is still a very noticeable “metalic” tone which makes it inferior to the real thing.

Impressive results nonetheless, and superior to all other TTS.

risho|2 years ago

how are you planning on using this with epubs? i'm in a similar boat. would really like to leverage something like this for ebooks.

gjm11|2 years ago

HN title at present is "StyleTTS2 – open-source Eleven Labs quality Text To Speech". Actual title at the far end doesn't name any particular other product; arXiv paper linked from there doesn't mention Eleven Labs either. I thought this sort of editorializing was frowned on.

stevenhuang|2 years ago

Eleven Labs is the gold standard for voice synthesis. There is nothing better out there.

So it is extremely notable for an open source system to be able to approach this level of quality, which is why I'd imagine most would appreciate the comparison. I know it caught my attention.

modeless|2 years ago

It is editorializing and it is an exaggeration. However I've been using StyleTTS2 myself and IMO it is the best open source TTS by far and definitely deserves a spot on the top of HN for a while.

GaggiX|2 years ago

Yes, it's against the guidelines. In fact, when I read the title, I didn't think it was a new research paper but a random GitHub project.

jasonjmcghee|2 years ago

Out of curiosity - to folks that have had success with this...

This voice cloning is... nothing like XTTSv2, let alone ElevenLabs.

It doesn't seem to care about accents at all. It does pretty well with pitch and cadence, and that's about it.

I've tried all kinds of different values for alpha, beta, embedding scale, diffusion steps.

Anyone else have better luck?

Sure it's fast and the sound quality is pretty good, but I can't get the voice cloning to work at all.

jsjmch|2 years ago

See my previous comment about this point. ElevenLabs are based on Tortoise-TTS which was already pre-trained on millions of hours of data, but this one was only trained on LibriTTS which was 500 hours at best. XTTS was also trained with probably millions of speakers in more than 20 languages.

If you have seen millions of voices, there are definitely gonna be some of them that sound like you. It is just a matter of training data, but it is very difficult to have someone collect these large amounts of data and train on it.

dsrtslnd23|2 years ago

See the conclusion remarks in the paper - they acknowledge that voice cloning is not that good (yet).

carbocation|2 years ago

I had the same experience as what you described (with a lot of experimentation with alpha and beta, as well as uploading different audio clips).

wg0|2 years ago

The quality is really really INSANE and pretty much unimaginable in early 2000s.

Could have interesting prospects for games where you have LLM assuming a character and such TTS giving those NPCs voice.

beachy|2 years ago

This is a big thing for one area I'm interested in - golf simulation.

Currently playing in a golf simulator has a bit of a post-apocalyptian vibe. The birds are cheeping, the grass is rustling, the game play is realistic, but there's not a human to be seen. Just so different from the smacktalking of a real round, or the crowd noise at a big game.

It's begging for some LLM-fuelled banter to be added.

sandslides|2 years ago

Just tried the collab notebooks. Seems to be very good quality. It also supports voice cloning.

fullstackchris|2 years ago

Great stuff, took a look through the README but... what are the minimum hardware requirements to run this? Is this gonna blow up my CPU / harddrive?

thot_experiment|2 years ago

I skimmed the github but didn't see any info on this, how long does it take to finetune to a particular voice?

stevenhuang|2 years ago

I really want to try this but making the venv to install all the torch dependencies is starting to get old lol.

How are other people dealing with this? Is there an easy way to get multiple venvs to share like a common torch venv? I can do this manually but I'm wondering if there's a tool out there that does this.

wczekalski|2 years ago

I use nix to setup the python env (python version + poetry + sometimes python packages that are difficult to install with poetry) and use poetry for the rest.

The workflow is:

  > nix flake init -t github:dialohq/flake-templates#python
  > nix develop -c $SHELL
  > # I'm in the shell with poetry env, I have a shell hook in the nix devenv that does poetry install and poetry activate.

stavros|2 years ago

I generally try to use Docker for this stuff, but yeah, it's the main reason why I pass on these, even though I've been looking for something like this. It's just too hard to figure out the dependencies.

lukasga|2 years ago

Can relate to this problem a lot. I have considered starting using a Docker dev container and making a base image for shared dependencies which I then can customize in a dockerfile for each new project, not sure if there's a better alternative though.

eurekin|2 years ago

Same here. I'm using conda and eyeing simply installing a pytorch into the base conda env

amelius|2 years ago

> is starting to get old lol.

If it's starting to get old, then this means that an LLM like Copilot should be able to do it for you, no?

carbocation|2 years ago

Curious if we'll see a Civitai-style LoRA[1] marketplace for text-to-speech models.

1 = https://github.com/microsoft/LoRA

carbocation|2 years ago

Having now tried it (the linked repo links to pre-built colab notebooks):

1) It does a fantastic job of text-to-speech.

2) I have had no success in getting any meaningful zero-shot voice cloning working. It technically runs and produces a voice, but it sounds nothing like the target voice. (This includes trying their microphone-based self-voice-cloning option.)

Presumably fine-tuning is needed - but I am curious if anyone had better luck with the zero-shot approach.

Evidlo|2 years ago

What's a ballpark estimate for inference time on a modern CPU?

beltsazar|2 years ago

If AI will render some jobs obsolete, I suppose the first one will be audio book narrators and voice actors.

riquito|2 years ago

I can see a future where the label "100% narrated by a human" (and similar in other industries) will be a thing

washadjeffmad|2 years ago

Hardly. Imagine licensing your voice to Amazon so that any customer could stream any book narrated in your likeness without you having to commit the time to record. You could still work as a custom voice artist, all with a "no clone" clause if you chose. You could profit from your performance and craft in a fraction of the time, focusing as your own agent on the management of your assets. Or, you could just keep and commit to your day job.

Just imagine hearing the final novel of ASoIaF narrated by Roy Dotrice and knowing that a royalty went to his family and estate, or if David Attenborough willed the digital likeness of his voice and its performance to the BBC for use in nature documentaries after his death.

The advent of recorded audio didn't put artists out of business, it expanded the industries that relied on them by allowing more of them to work. Film and tape didn't put artists out of business, it expanded the industries that relied on them by allowing more of them to work. Audio digitization and the internet didn't put artists out of business; it expanded the industries that relied on them by allowing more of them to work.

And TTS won't put artists out of business, but it will create yet another new market with another niche that people will have to figure out how to monetize, even though 98% of the revenues will still somehow end up with the distributors.

jasonjmcghee|2 years ago

I've been playing with XTTSv2 and on my 3080ti, and it's sightly faster than the length of the final audio. It's also good quality, but these samples sound better.

Excited to try it out!

exizt88|2 years ago

The weights aren’t MIT-licensed, so this is not usable in commercial applications, right?

acheong08|2 years ago

It is usable in commercial applications given you disclose the use of AI. This applies only to the pre-trained models. You can train your own from scratch without these restrictions.

You can fine tune it on your own voice and also not be required to disclose the use of AI.

causality0|2 years ago

What are the chances this gets packaged into something a little more streamlined to use? I have a lot of ebooks I'd love to generate audio versions of.

victorbjorklund|2 years ago

This only works for English voices right?

e12e|2 years ago

No? From the readme:

In Utils folder, there are three pre-trained models:

    ASR folder: It contains the pre-trained text aligner, which was pre-trained on English (LibriTTS), Japanese (JVS), and Chinese (AiShell) corpus. It works well for most other languages without fine-tuning, but you can always train your own text aligner with the code here: yl4579/AuxiliaryASR.

    JDC folder: It contains the pre-trained pitch extractor, which was pre-trained on English (LibriTTS) corpus only. However, it works well for other languages too because F0 is independent of language. If you want to train on singing corpus, it is recommended to train a new pitch extractor with the code here: yl4579/PitchExtractor.

    PLBERT folder: It contains the pre-trained PL-BERT model, which was pre-trained on English (Wikipedia) corpus only. It probably does not work very well on other languages, so you will need to train a different PL-BERT for different languages using the repo here: yl4579/PL-BERT. You can also replace this module with other phoneme BERT models like XPhoneBERT which is pre-trained on more than 100 languages.

acheong08|2 years ago

I am an introvert: I rarely socialize, listen to podcasts at 2x speed, and mostly use subtitles rather than listening to audio for movies; therefore having a below average ability to differentiate humans/robots.

I asked someone to play the recordings for me to differentiate. I could not tell which was human (only between StyleTTS2 and Ground truth. The others were obvious)

rsbeare|2 years ago

This is great! Nice work.

I made my own whisper & auto-typer which types what you say (forked whisper-typer).

I added OpenAI Q/A and RAG query feature so I could ask it questions (instead of auto keystroke typing) by voice command. For responses to questions, I used Eleven Labs - but even with latency optimized & streaming, it was slow, so disabled it.

I just swapped from OpenAI to Mistral 7b for Q/A querying. Much more responsive. Stoked to explore StyleTTS2 now!

Really glad that I came across your post. Thank you for sharing!

svapnil|2 years ago

How fast is inference with this model?

For reference, I'm using 11Labs to synthesize short messages - maybe a sentence or something, using voice cloning, and I'm getting it at around 400 - 500ms response times.

Is there any OS solution that gets me to around the same inference time?

wczekalski|2 years ago

It depends on hardware but IIRC on V100s it took 0.01-0.03s for 1s of audio.

api|2 years ago

It should be pretty easy to make training data for TTS. The Whisper STT models are open so just chop up a ton of audio and use Whisper to annotate it, then train the other direction to produce audio from text. So you’re basically inverting Whisper.

eginhard|2 years ago

STT training data includes all kinds of "noisy" speech so that the model learns to recognise speech in any conditions. TTS training data needs to be as clean as possible so that you don't introduce artefacts in the output and this high-quality data is much harder to get. A simple inversion is not really feasible or at least requires filtering out much of the data.

nmfisher|2 years ago

I think you’re talking about just using Whisper to annotate audio for a TTS pipeline but someone from Collabora actually created a TTS model directly from Whisper embeddings https://github.com/collabora/WhisperSpeech

wanderingmind|2 years ago

As a tangent away from LLMs, is there an integration available to be used in Android as TTS Engine?. The TTS voice that I have now (RHVoice) for OSMAnd is really driving me crazy and almost makes me want to go back to Google Maps.

zsoltkacsandi|2 years ago

Is it possible to optimize somehow the model to run a Raspberry with 4 GB of RAM?

synesthesiam|2 years ago

You may want to try Piper for this case (RPi 4): https://github.com/rhasspy/piper

zsoltkacsandi|2 years ago

I was able to get it work with libjemalloc.

unknown|2 years ago

[deleted]

GaggiX|2 years ago

They really should have uploaded the models on Huggingface than Gdrive.

wahnfrieden|2 years ago

Is there a way to port this to iOS? Apple doesn't provide an API for their version of this.

visarga|2 years ago

Yes, please integrate it with Mistral and Whisper. This has got to get into the LLM frontends.

modeless|2 years ago

Done: https://apps.microsoft.com/detail/9NC624PBFGB7

It's mostly just a demo for now and a little bit janky but it's fun to chat with and you can see the promise for 100% local voice AI in the future.

Monicjames|2 years ago

So, we've got this open-source TTS wizardry going on, which is kinda like if Siri had a caffeine overdose - faster, snappier, and way more fun at parties. This thing is running on gaming rigs with beefy GPUs, and it's apparently so user-friendly, even your grandma could set it up without accidentally summoning a digital demon.

But here's the real kicker - it's got the manners of a Victorian gentleman. You can rudely interrupt it mid-sentence, and it'll just stop and listen. Politeness level 100. The reverse, though - getting Mr. Bot to interrupt you - is still in the 'that's too much brain for my silicon' phase. Like, how do you teach a bunch of 1s and 0s to know when you're just taking a dramatic pause or actually done with your TED talk?

And get this - they're talking about making this bot read body language. Imagine your laptop judging you for your slouchy posture or that 'I haven't slept properly in days' look. Creepy? Maybe a bit. Cool? Absolutely.

In conclusion, StyleTTS2 is shaping up to be the cool new kid on the block, but it's still learning the ropes of human conversation. It's like that super smart friend who knows everything about quantum physics but can't tell when you're sarcastically saying 'Yeah, sure, let's invade Mars tomorrow.

deknos|2 years ago

Is this really opensource and/or free software? like code, data(set/s) and models?

I am quite tired to see some "open-source" advertisement, where the half or more is not really free.

general psa: please be honest in your announcements :|

acheong08|2 years ago

MIT licensed. Models, code, and everything is available right there when you click the link.

Maybe actually check it out before complaining.

swyx|2 years ago

silicon valley is very leaky, eleven labs is widely rumored to have raised a huge round recently. great timing because with OpenAI's TTS and now this thing the options in the market have just expanded greatly.

ideasman42|2 years ago

Once this is working, is there a simple way to switch voices with the default downloaded models? Or does this require downloading other models or generating them?

ideasman42|2 years ago

When trying to input a larger amount of text I get the error:

The expanded size of the tensor (4293) must match the existing size (512)

Any way to fix this from the IPython notebook examples?

kats|2 years ago

This is really harmful and unethical work. It will be used to hurt millions of elderly people with scams. That's the real application that will happen 100x more than anything else. It's unethical and harmful to release tools that will be overwhelmingly used to hurt elderly people. What they should do about it is: Stop releasing models. Only release a service so that scammers will not use it. Also, only released audio that is watermarked, so that apps can tell that a phone call might be a scam. When they share models with researchers, use previous best practices: post a Google Form to request access.

slow_numbnut|2 years ago

Just imagine if this line of thinking was used elsewhere.

This tech is already out of the bag and I thank the author(s) for the contribution to humanity. The correct solution here is not to shove your head in the sand and ignore reality, but to get your government to penalize any country or company that facilitates this crime. If they can force severe penalties for other financial crimes and funding terrorism, they can do the same here.

mx20|2 years ago

Scammers scamming old people is already very wide spread, so should we maybe outlaw telephones as well? Or maybe mandate anti scamming filters that disconnect if something is discussed that could be a scam? If I think about it that actually would make more sense, but still be problematic.

127|2 years ago

Cars actually kill over a million of people per year. Not saying this is good, just that all technology has its tradeoffs.

flarg|2 years ago

Millions of elderly people are already getting scammed by overseas call centers so unless we do something more significant this tech will not make one iota of a difference.

lfmunoz4|2 years ago

Been looking for a speech to text that can work in real time and run locally, anyone know which are the best options available?

Havoc|2 years ago

Those sound incredibly good.

Though would def like to clone a pleasant voice on it before using. Those sound good but not my cup of tea

tomcam|2 years ago

Very impressive. It would take me a long time to even guess that some of these are text to speech.

readyplayernull|2 years ago

Someone please create a TTS with marked-down emotions/intonations.

ddmma|2 years ago

Well done, been waiting for a moment like this. Will give it a try!

lxe|2 years ago

Wow this thing is wicked fast!

progbits|2 years ago

> MIT license

> Before using these models, you agree to [...]

No, this is not MIT. If you don't like MIT license then feel free to use something else, but you can't pretend this is open source and then attempt to slap on additional restrictions on how the code can be used.

gpm|2 years ago

As I understand it the source code is licensed MIT, the weights are licensed "weird proprietary license that doesn't explicitly grant you any rights and implicitly probably grants you some usage rights so long as you tell the listeners or have permission from the voice you cloned".

Which, if you think the weights are copyright-able in the first place, makes them practically unusable for anything commercial/that you might get sued over because relying on a vague implicit license is definitely not a good idea.

weego|2 years ago

I think you mis-parsed the disclaimer. It's just warning people that cloned voices come with a different set of rights to the software (because the person the voice is a clone of has rights to their voice).

IshKebab|2 years ago

I think that's referring to the pre-trained models, not the source code.

ericra|2 years ago

This bothered me as well. I opened an issue on the repo asking them to consider updating the license file to reflect these additional requirements.

The wording they currently use suggests that this additional license requirement applies not only to their pre-trained models.

unknown|2 years ago

[deleted]

pdntspa|2 years ago

As if anyone outside of corporate legal actually cares

sandslides|2 years ago

Yes, I noticed that. Doesn't seem right does it

unknown|2 years ago

[deleted]

mlsu|2 years ago

We're now at "free, local, AI friend that you can have conversations with on consumer hardware" territory.

- synthesize an avatar using stablediffusion

- synthesize conversation with llama

- synthesize the voice with this text thing

soon

- VR

- Video

wild times!

trafficante|2 years ago

Seems like a fun afternoon project to get this hooked into one of the Skyrim TTS mods. I previously messed around with elevenlabs, but it had too much latency and would be somewhat expensive long term so I’m excited to try local and free.

I’m sure I have a lot of reading up to do first, but is it a safe assumption that I’d be better served running this on an m2 mbp rather than tax out my desktop’s poor 3070 running it on top of Skyrim VR?

cloudking|2 years ago

Would be great to have a local home assistant voice interface with this + llama + whisper.

imiric|2 years ago

I'm looking forward to this tech being used in video games, as well as generative models in general. Interacting with smart NPCs will make everyone's experience different. The avatars themselves could be dynamically generated, and entire environments for that matter. Truly game changing technology for interactive entertainment.

jpeter|2 years ago

Which consumer gpu runs llama 70B?

Hamcha|2 years ago

Yup, and you can already mix and match both local and cloud AIs with stuff like SillyTavern/RealmPlay if you wanna try what the experience is like, people have been using it to roleplay for a while.

unknown|2 years ago

[deleted]

motivence7856|2 years ago

[deleted]

mazoza|2 years ago

meh this is not that good. Sounds quite boring.

ChildOfChaos|2 years ago

Agreed, this isn't Eleven labs quality at all.

godelski|2 years ago

Why name it Style<anything> if it isn't a StyleGAN? Looks like the first one wasn't either. Interesting to see moves away from flows, especially when none of the flows were modern.

Also, is no one clicking on the audio links? There are some... questionable ones... and I'm pretty sure lots of mistakes.

lhl|2 years ago

It's not called a GAN TTS right? StyleGAN is called what it is because of a "style-based" approach and StyleTTS/2 seems to be doing the same (applying style transfer) through different method (and disentangling style from the rest of the voice synthesis).

(Actually, looked at the original StyleTTS paper and it actually even partially uses AdaIN in the decoder, which is the same way that StyleGAN injected style information? Still, I think is besides the point for the naming.)

gwern|2 years ago

> Looks like the first one wasn't either.

The first one says it uses AdaIN layers to help control style? https://arxiv.org/pdf/2205.15439.pdf#page=2 Seems as justifiable as the original StyleGAN calling itself StyleX...

234 comments