One thing I've seen done for style cloning is a high quality fine tuned TTS -> RVC pipeline to "enhance" the output. TTS for intonation + pronunciation, RVC for voice texture. With StyleTTS and this pipeline you should get close to ElevenLabs.
I suspect they are doing many more things to make it sounds better. I certainly hope open source solutions can approach that level of quality, but so far I've been very disappointed.
eigenvalue|2 years ago
KolmogorovComp|2 years ago
a2128|2 years ago
stavros|2 years ago