top | item 38834168

(no title)

z991 | 2 years ago

I commend the authors on making this easy to try! However it doesn't work very well for me for general voice cloning. I read the first paragraph of the wikipedia page on books and had it generate the next sentence. It's obviously computer generated to my ear.

Audio sample: https://storage.googleapis.com/dalle-party/sample.mp3

Cloned voice (converted to mp3): https://storage.googleapis.com/dalle-party/output_en_default...

All I did was install the packages with pip and then run "demo_part1.ipynb" with my audio sample plugged in. Ran almost instantly on my laptop 3070 Ti / 8GB. (Also, I admit to not reading the paper, I just ran the code)

discuss

dijksterhuis|2 years ago

> It's obviously computer generated to my ear.

From the README

    Disclaimer

    This is an open-source implementation that approximates the performance of the internal voice clone technology of myshell.ai. The online version in myshell.ai has better 1) audio quality, 2) voice cloning similarity, 3) speech naturalness and 4) computational efficiency.

uoaei|2 years ago

So this paper is a thinly veiled ad of myshell.ai's services?

3abiton|2 years ago

Not totally unexpected unfortunately. Any other OSS players on the market?

fbdab103|2 years ago

Thanks for the real example. Sounded quite generated to my ear as well. Wonder if it can do any better with more source material.

pclmulqdq|2 years ago

Looking at the website and the examples, it's pretty clearly set up to make stylized anime voices.

japanman185|2 years ago

This is the driver for a lot of things. Anime. x264 was to enable better compression of weeb videos. This tech will allow fan dubs to better represent the animes in the videos.

thorum|2 years ago

My experience with other tools like xtts is you really need to have a studio-quality voice sample to get the best results.

amluto|2 years ago

The most obvious problem to my ears is the syllable timing and inflection of the generated speech, and, intuitively, this doesn’t seem like a recording quality issue. It’s as if it did a mostly credible job of emulating the speaker trying to talk like a robot.

nxobject|2 years ago

That might be the next big contribution – performance in perceptually catching the features of a not-so-good recording – for example, with a webcam style microphone.