(no title)
mazoza | 1 year ago
hertz-vae: a 1.8 billion parameter transformer decoder which acts as a learned prior for the audio VAE. The model uses a context of 8192 sampled latent representations (17 minutes) and predicts the next encoded audio frame as a mixture of gaussians. 15 bits of quantized information from the next token act as semantic scaffolding to steer the generation in a streamable manner.
programjames|1 year ago
1. `codec`: First, compress 16k samplerate audio into 8 samples per second with convolutions. Then, vector quantize to 128 bits (probably 8 floats) to get a codec. This is not nearly enough bits to actually represent the audio, it's more to represent phenomes.
2. `vae` -> This looks like a VAE-based diffusion model, that uses the codec as its prompt.
3. `dev` -> This is a next-codec prediction model.
Put together, it probably runs like so:
1. Turn your prompt into tokens with the `codec`.
2. If you want s more seconds of audio, use `dev` to predict 8 * s more tokens.
3. Turn it back into audio with the `vae` diffusion model.
mazoza|1 year ago
unknown|1 year ago
[deleted]