top | item 34541836

MusicLM: Generating music from text

291 points| georgehill | 3 years ago |arxiv.org

107 comments

order
[+] georgehill|3 years ago|reply
[+] modeless|3 years ago|reply
I love that there's a whole section for accordion examples. I think the accordion rap has a bad word in it. Accordion techno works surprisingly well.

They really need to bump up to 48 KHz so all the music doesn't sound like it's being played over a telephone. A factor of two in the cost shouldn't be prohibitive. So much of the audio generation stuff I've seen has fatal flaws like this baked into the dataset and/or training process that ensure the output can't sound good even in theory, it's kinda frustrating.

It's also frustrating that nobody AFAIK has trained a music model on an actually large dataset. We're training large language models on a significant fraction of the text of the whole internet! Where are the audio models trained on a significant fraction of all recorded music? This one was trained (in part) on a dataset of 280k hours of music, if I read the paper correctly. I don't think that comes anywhere close to being a significant fraction of all recorded music.

[+] epistemer|3 years ago|reply
Epic soundtrack using orchestral instruments is somewhat interesting as electronic music but to me everything else is pretty bad.

There is a million Industrial techno sounds, repetitive, hypnotic rhythms IDM tracks that absolutely no one listens to anymore. Most the founders of the whole genre have moved on from lack of interest.

We have been able to do really good generative music in Reaktor for the last 20 years. Much better than anything on this page but no one cares.

People in general want to hear the same slight variation in music over and over as background noise.

The people that are saying how great all this is will be bored with it in 2 weeks or less. Pure technological kitsch.

[+] isoprophlex|3 years ago|reply
Holy hell those are insanely good. Best generated music I've heard so far.

I was born too soon to explore the stars, born too late to be a 17th century pirate... but born just in time to explore these incredible neural networks come to life.

[+] winrid|3 years ago|reply
The vocals (on second page of table) are really interesting. If you walked into a club, you might not notice they are nonsense, but it's just good enough BS to be convincing.
[+] beefman|3 years ago|reply
Totally fails on "swing" and "relaxing jazz".
[+] zan2434|3 years ago|reply
This should replace the OP link
[+] whoomp12342|3 years ago|reply
dope, can it queue up weird al? I am curious how generative would work at coming up with parody
[+] TedDoesntTalk|3 years ago|reply
I’m writing my second video game (old school arcade style side-shooter) and hope I can use this to generate background chip tune music!
[+] snowram|3 years ago|reply
May I recommend to learn how to use a tracker software? It is very fun to play with!
[+] frinnylee|3 years ago|reply
We have a AI platform finding music from video/images/game screen recording. And support musicians at the same time. Check it outhttps://avmapping.co/
[+] throwaway290|3 years ago|reply
I hope you also consider getting some music from an actual producer!
[+] dyno12345|3 years ago|reply
I want to do the reverse: pass in some music and have it describe what I'm hearing
[+] brookst|3 years ago|reply
Yes! And not just notes or chords or time signatures, but real analysis of how the piece works.
[+] dimatura|3 years ago|reply
I believe parts of this model could be adapted to accomplish this. The "mulan" network generates embeddings from music and text so that they "match" each other, i.e. audio segments and text captions that go well together would be mapped to the same embedding. At run-time, they use embeddings from the text to generate the music. So conceivably the opposite could be done, using the "mulan" embeddings from audio only to generate text, by replacing the "soundstream" decoder with a text LLM. That said, there's probably simpler methods that could generate decent results, and I'd be surprised if there isn't already some out there. It's worth noting that along with this work they released a dataset of music with captions that could also help training new models for this task.
[+] 52-6F-62|3 years ago|reply
I think you’re missing the point. Music is an experience, not a checkbox or text book needing summarized.
[+] spyder|3 years ago|reply
Awesome, this is probably the best one to date even compared to the recent Riffusion, Jukebox and the older MIDI generating MuseNet.

especially the conditioning on humming and whistling examples are cool, but to bad they use very common melodies for that so it's easier job for the model and harder for us to judge how well would it work on less common melodies.

[+] LunarAurora|3 years ago|reply
> especially the conditioning on humming and whistling ... but to bad they use very common melodies

100%. Now if you add the detailed "Painting Caption Conditioning" to the mix, you can create a melody in the style of… an image, which, in a way, is kind of a controlled artificial synaesthesia [1].

[1] https://en.wikipedia.org/wiki/Synesthesia

[+] williamcotton|3 years ago|reply
As a musician I don’t see these tools as competing with what I do but enabling me to do so much more. It takes a lot of time and money to create a professional recording for my songs and drum machines and synths just don’t work for Americana. These tools offer the possibility of backing tracks that sound like Willie Nelson’s band from the 70s but at a fraction of the time and effort.

I can’t wait until they get to the point where they’re more composable or auto-accompany given an acoustic guitar and vocal input.

[+] dariosalvi78|3 years ago|reply
AI music is the next thing coming, wait for copyright lawsuits to fall like bombs.

I find it fascinating that MidJourney can make a 3D model of my face from a low quality image, rotate it in space, apply it on someone's else body, add coherent shadows and backgrounds, with a very credible result, and yet an AI cannot generate a decent song, which is 1-dimensional and has probably much less internal modelling to care about?

One reason I can think of is because eyes are "integrators" and ears are "derivators". That is, that human ear is very sensitive so small differences, whereas vision cares more about the ensemble? I don't know, but I think that AI music will come one day. It may not be as great as human music, but it will suffice for, say, putting a music background for your startup cheap marketing ad.

[+] infinitifall|3 years ago|reply
Possibly just anecdotal evidence, but it seems like the pool of good music composers is much much smaller than the pool of good visual artists. Maybe that is an indicator of how "hard" each field is.
[+] 52-6F-62|3 years ago|reply
> which is 1-dimensional and has probably much less internal modelling to care about

There is an ocean between “I like that sound” and a final, produced piece of recorded music. Much of that ocean being ineffable.

Trying to reduce it to a set of parameters around the final waveform and you’ve missed the entire point.

> but it will suffice for, say, putting a music background for your startup cheap marketing ad.

We need less of that—not more. It’s like a climate disaster of the soul.

[+] jzombie|3 years ago|reply
In my opinion, good music is multidimensional.

Just like a 3D model is really only 2D on a screen, music can encompass so much more than what is heard on the surface level with no imagination.

[+] jtode|3 years ago|reply
It will be as good as the music it's trained on, no better and probably no worse.
[+] woolion|3 years ago|reply
I don't really understand why this approached is pushed for music. You can overpaint an image, but you can't do that with a song. Cutting an image to reintroduce coherence is easy too. For a song you need midi, or another symbolic representation. That was the approach of pop2piano (unfortunately it is limited to covers, not generating from scratch). And even if a song generated this is OK, listening to half an hour full of AI mistakes is really tiring. With a symbolic representation you could at least fix the mistakes if there is one good output.
[+] dimatura|3 years ago|reply
I understand what you're saying, although it could be argued that at least for some types of image tasks one would prefer something like an SVG output with layers to make it easier to edit.

For music, I think it's partly an academic question of "can we do it" rather than trying to maximize immediate practical usefulness. There's already quite a bit of work on symbolic music generation (mostly MIDI), a lot of it quite competent, especially in more constrained domains like NES chiptunes or classical piano, so a full text-to-audio pipeline probably seemed a more interesting research problem.

And for a lot of use cases, where people might truly not care too much about tweaking the output to their liking, the generated audio might be good enough; the examples were pretty plausible to my ear, if somewhat lo-fi sounding (probably because it's operating at 24kHz, compared to the more standard 44-48kHz).

In the future a more hybrid approach probably makes sense for at least some applications, where MIDI is generated along with some way of specifying the timbre for each instrument (hopefully something better than general MIDI, though even that would be fun; not sure if it's been done). I'm sure that in the near we'll see a lot more work in the DAW and plugin space to have these kind of things built-in, but in a way that they can be edited by the user.

[+] snowram|3 years ago|reply
You can absolutely "overpaint" sound, as sound can indeed be represented as an image and be extended in frequency of in time. In fact this is exactly what Riffusion is doing.
[+] jameshart|3 years ago|reply
“For a song you need midi”

Pretty sure the Beatles never handed George Martin any midi files. What’s the symbolic representation that captures the tone of every bend in a Hendrix solo? Did Daft Punk go back and grab the raw master stems of the old vinyl recordings they used to assemble their tracks?

Music producers have been astonishingly creative given inputs in a vast range of formats. Sheet music and midi are one tool, but ultimately it’s about combining sounds in the mix isn’t it?

[+] TaupeRanger|3 years ago|reply
Yet another cherry picked attempt at music gen with a "demo" page that only contains the outputs that happen to not sound like incoherent noise. There's a reason ChatGPT and Midjourney are so popular, but no music-gen tools have even come close: you can actually create stuff with them that is useful and/or enjoyable. Good music gen is much harder, and the reasons for this (still unclear) are pretty important to the future of AI, imo.
[+] mxgr|3 years ago|reply
That's quite a harsh critique that doesn't seem warranted. There are quite a lot of different examples with varying quality. Granted, it's likely that these are to some extent cherry picked, but it's unlikely that these are rare and that most output sounds like incoherent noise, as you seem to insinuate.
[+] not-chatgpt|3 years ago|reply
Their current demos seem worse than riffusion results I get on average. Music gen is hard because music is inherently a composed of many many different instruments, each with unique sound and function. Simply training end-to-end will almost always end up badly.
[+] papruapap|3 years ago|reply
Well... no all releases have to be toys for your average twitter user to play with. Most of them are just academic stuff to show team progress.
[+] xp84|3 years ago|reply
They said they're not going to allow people to use it based on fear of plagiarism accusations / music industry lawsuits. Ugh, typical.

I want someone to train it on public domain music. Kind of like the YouTube Audio Library but I assume that's not exactly the right license for this. But with sufficient effort someone could make a lot of recordings of public domain music for this purpose and build something that the RIAA thugs couldn't actually touch.

[+] gschoeni|3 years ago|reply
Any plans on releasing the actual audio data and not just a csv with links to YouTube IDs?

Also maybe a silly question, but what's the legal ramifications of downloading these YouTube videos and training on them yourself? Google must have some rights, but what about people outside of Google?

[+] duckington|3 years ago|reply
I wonder if this technology will eventually revolutionize music the same way synthesizers did. Or at least lead to music and effects/filters that are simply not possible with current DAWs and plugins.

Custom generation of samples from text alone seems revolutionary.

[+] threevox|3 years ago|reply
Google: kings of releasing papers but never shipping anything
[+] andjelam990|3 years ago|reply
Interesting concept! I wonder how the hinders around copywriting will be solved.
[+] guyisra|3 years ago|reply
Yet Another ClosedAI research project with no intention of releasing..

what a joke

[+] akie|3 years ago|reply
I hate how good this is.