top | item 23201087

(no title)

msclrhd | 5 years ago

The audio generation is split into packets or chunks of audio. There are two general methods of storing/generating these chunks: 1. LPC/formants -- building the audio using the LPC (linear predictive coding) or frequency (formant) parameters and a residual carrier for the difference; 2. OLA (overlapped add) -- combining and overlapping audio split at individual wave (peak to peak) packets.

The different systems (klatt, MBROLA, etc.) build on these basic techniques. The *OLA systems are more space intensive, as they need more of the audio data to synthesize the voice, but tend to result in more natural voices.

The other complexity is how the individual phonemes (building blocks of speech) are stored. Different languages and accents have different sets of phonemes, so more complex languages require more data. Systems like MBROLA store the data as diphones (from the mid-point of the first phoneme to the mid-point of the second), to make it easier to join the phonemes. There are more diphones, although not all combinations are in a given language and accent.

This data is controlled by different prosodic parameters (e.g. pitch and duration). Neural nets can be used to control these parameters, or techniques like decision trees and probabilistic models.

IIUC, there are two approaches to neural nets when using audio generation: 1) generate the LPC/formant parameters; 2) generate the audio directly. These can either operate on phonemes, or on the text directly. Operating on the text directly limits the voice to a particular language and accent, but potentially allows the neural net to infer pronunciation rules itself.

discuss

sdenton4|5 years ago

> two approaches

Hmmmm.... In my experience, for (neural) TTS the dominant option is to have one model to generate a melspectrogram from the text (handling prosody) and a second model for synthesizing samples from the melspectrogram. (Tacotron, Lyrebird, and this Facebook group are all doing this.) There's certainly research projects on going directly from text to samples, but it's not the currently winning strategy... Maybe eventually, though. The Text->Mel portion specializes the prosody and pronunciation problem, and provides a nice place to add extra conditioning.

On the vocoder side: LPC, F0, etc can all be estimated from a reasonably sized melspectrogram; for the most part, these neural models are just letting the big vocoder model handle all of these things which are traditionally (fragile!) subtasks. The question is which "classical" parts are both cheap and reliable: you can compute these on the side and lighten the neural network's burden. LPC is great for this.