top | item 39934937

(no title)

muxator | 1 year ago

I suppose the focus was on voice synthesis here. I won't add anything about it since other commenters have already said significant things about this wonderful feat.

Musically, however, I can't help but notice that these models are still very far from being able to generate something interesting: from harmony, to tempo, to musical structure, to dynamics, everything is muddled and without structure. I guess there is still very much to work on, and I am not sure that purely generative models can attain higher levels. Maybe a mixed rule-based and generative approach would do?

The progress is really fast in this field, I really do not know.

discuss

order

BriggyDwiggs42|1 year ago

I think historically every time someone says that the solution to an ai problem is more structure, the truth turns out to be an issue mostly of data and scale

muxator|1 year ago

That's probably true. Maybe there is a point to trade computational/energetic efficiency for attainability of a result. Let's see how this unfolds.

notjulianjaynes|1 year ago

To my knowledge, the model being used for this is "chirp" which is 'based on' bark[1], an AI text to speech model.

The github page for bark links to a page about chirp, which returns a 404 page for me [2]. My guess is that the model used for suno.ai's song generator isn't too much different than the text to speech model.

I also have a hunch is that it was more like a coincidence than intentional that the bark model was capable of producing music, and that was spun off into this product.

Unfortunately, there seems to still be issues with bark when generating long (like book length) spoken audio. Which is too bad, as someone who's worked jobs that require lots of driving, it would be awesome to be able to have any text read to me in a natural sounding voice.

[1]https://github.com/suno-ai/bark [2] https://www.suno.ai/examples/chirp-v1

Almondsetat|1 year ago

What structure and tempo can you realistically give to the MIT license?

muxator|1 year ago

I'll try to give a serious answer, even if I suppose yours was a nice joke :)

Music is a language, even if with no semantic. It has conventions, dialects, a syntax, a grammar. There are multiple dimensions a musician uses to convey what he wants/feels: just like an actor has to control at the same time its voice, posture, interplay with other actors, so a good musician is aware of the structure of the piece he is composing/executing, the relations between the various subparts, how the musical discourse progresses in time, besides agogic, dynamics, sound color.

All of those aspects are continually perpetually compared against the conventions of the genre, mixed, evolved, strictly followed or balatantly negated.

This is something that normally a professional musician takes decades to master (apart from musical geniuses).

A listener takes less time to educate himself to appreciate those nuances (but not too little: let's say ~years). Once you develop a taste, it becomes very obvious to see through the spectrum that goes from bad quality tunes to musical artistry.

I see nothing musically interesting in this (wonderful) PoC of speech synthesis.

Just to be clear: I did not see anything particularly stunning even in Google's Bach Doodle from some years ago https://doodles.google/doodle/celebrating-johann-sebastian-b...

anileated|1 year ago

As always with art, the answer is: it depends on what you think, feel like and/or are trying to convey.

kevinmhickey|1 year ago

Reminds me a little bit of Catholic mass when the priest "sings" some of the sections. There is no consistency, no cadence, but their voice goes up and down. It's high-effort talking.

I wonder if these models would do something better if the text were poetic or punctuated differently.

wildzzz|1 year ago

All the AI generated music just sounds like someone jamming without any hint of any real melody, original or a cover. It's very strange to listen to. It sounds exactly like an AI generated photo of a person looks like. Looks/sounds kinda real until you look/listen closer.