Other author here! This got a posted a little earlier than we intended so we didn't have our GPUs scaled up yet. Please hang on and try throughout the day!
This has been our hobby project for the past few months. Seeing the incredible results of stable diffusion, we were curious if we could fine tune the model to output spectrograms and then convert to audio clips. The answer to that was a resounding yes, and we became addicted to generating music from text prompts. There are existing works for generating audio or MIDI from text, but none as simple or general as fine tuning the image-based model. Taking it a step further, we made an interactive experience for generating looping audio from text prompts in real time. To do this we built a web app where you type in prompts like a jukebox, and audio clips are generated on the fly. To make the audio loop and transition smoothly, we implemented a pipeline that does img2img conditioning combined with latent space interpolation.
Wow, I am blown away. Some of these clips are really good! I love the Arabic Gospel one. John and George would have loved this so much. And the fact that you can make things that sound good by going through visual space feels to me like the discovery of a Deep Truth, one that goes beyond even the Fourier transform because it somehow connects the aesthetics of the two domains.
As one of the meatsacks whose job you're about to kill... eh, I got nothin, it's damn impressive. It's gonna hit electronic music like a nuclear bomb, I'd wager.
I've compiled/run a dozen different image to sound programs and none of them produce an acceptable sound. This bit of your code alone would be a great application by itself.
It'd be really cool if you could implement an MS paint style spectrum painting or image upload into the web app for more "manual" sound generation.
Hi Hayk, I see that the inference code and the final model are open source. I am not expecting it, but is the training code and the dataset you used for fine-tuning, and process to generate the dataset open source?
"fine-tuned on images of spectrograms paired with text"
How many paired training images / text and what was the source of your training data? Just curious to know how much fine tuning was needed to get the results and what the breadth / scope of the images were in terms of original sources to train on to get sufficient musical diversity.
The audio sounds a bit lossy, would it be possible to create high quality spectograms from music, downsample them, and use that as training data for a spectogram upscaler?
It might be the last step this AI needs to bring some extra clarity to the output.
This is amazing! This is a fantastic concept generator. The verisimilitude with specific composers and techniques is more than a little uncanny. A few thoughts after exploring today…
- My strongest suggestion is finding some strategy for smoothing over the sometimes harsh-sounding edge of the sample window
- Perhaps it could be filling in/passing over segments of what is sounded to user as a larger loop? Both giving it a larger window to articulate things but maybe also showcasing the interpolation more clearly…
- Tone control may seem challenging but I do wonder if you couldn’t “tune” the output of the model as a whole somehow (given the spectrogram format it could be a translation/scale knob potentially?)
When you say fine tuned do you mean fine tuned on an existing stable diffusion checkpoint? If so which?
It would be very interesting to see what the stable diffusion community that is using automatic1111 version would do with this if it were made into an extension.
Super clever idea of course. But leaving aside how it was produced, I’ll be one of those who is underwhelmed by the musicality of this. I am judging this in terms of classical music. I repeatedly tried to get it to just play pure piano music without any other add-ons (cymbals etc). It kept mixing the piano with other stuff.
Also the key question is - would something like this ever produce something as hauntingly beautiful and unique as classical music pieces?
Hayk! How smart are you! I loved your work on SymForce and Skydio - totally wasn't expecting you to be co-author on this!
On a serious note, I'd really love some advice from you on time management and how you get so much done? I love Skydio and the problems you are solving, especially on the autonomy front, are HARD. You are the VP of Autonomy there and yet also managed to get this done! You are clearly doing something right. Teach us, senpai!
Hello - this is awesome work. Like other commenters, I think the idea that if you are able to transfer a concept into a visual domain (in this case via fft) it becomes viable to model with diffusion is super exciting but maybe an oversimplification. With that in mind, do you think this type of approach might work with panels of time series data?
How much data is used for fine tuning? Since spectrograms are (surely?) very out of distribution for the pre training dataset, how much does value does the pre training really bring?
This is groundbreaking! All other attempts at AI generated music have IMO, fallen flat... These results are actually listenable, and enjoyable! This is almost frightening how powerful this can be
Obviously this needs a little more polish, but I've wanted this for so long I'm willing to pay for it now if it helps push the tech forward. Can I give you money?
What sort of setup do you need to be able to fine tune Stable Diffusion models? Are there good tutorials out there for fine tuning with cloud or non-cloud GPUs?
This really is unreasonably effective. Spectrograms are a lot less forgiving of minor errors than a painting. Move a brush stroke up or down a few pixels, you probably won't notice. Move a spectral element up or down a bit and you have a completely different sound. I don't understand how this can possibly be precise enough to generate anything close to a cohesive output.
Author here: We were blown away too. This project started with a question in our minds about whether it was even possible for the stable diffusion model architecture to output something with the level of fidelity needed for the resulting audio to sound reasonable.
Wasn't this Fraunhofer's big insight that led to the development of MP3? Human perception actually is pretty forgiving of perturbations in the Fourier domain.
This is a genius idea. Using an already-existing and well-performing image model, and just encoding input/output as a spectrogram... It's elegant, it's obvious in retrospective, it's just pure genius.
I can't wait to hear some serious AI music-making a few years from now.
Some of this is really cool! The 20 step interpolations are very special, because they're concepts that are distinct and novel.
It absolutely sucks at cymbals, though. Everything sounds like realaudio :) composition's lacking, too. It's loop-y.
Set this up to make AI dubtechno or trip-hop. It likes bass and indistinctness and hypnotic repetitiveness. Might also be good at weird atonal stuff, because it doesn't inherently have any notion of what a key or mode is?
As a human musician and producer I'm super interested in the kinds of clarity and sonority we used to get out of classic albums (which the industry has kinda drifted away from for decades) so the way for this to take over for ME would involve a hell of a lot more resolution of the FFT imagery, especially in the highs, plus some way to also do another AI-ification of what different parts of the song exist (like a further layer but it controls abrupt switches of prompt)
It could probably do bad modern production fairly well even now :) exaggeration, but not much, when stuff is really overproduced it starts to get way more indistinct, and this can do indistinct. It's realaudio grade, it needs to be more like 128kbps mp3 grade.
This show me that Stable Diffusion can create anything with the following conditions:
1. Can be represented as as static item on two dimensions (their weaving together notwithstanding, it is still piece-by-piece statically built)
2. Acceptable with a certain amount of lossiness on the encoding/decoding
3. Can be presented through a medium that at some point in creation is digitally encoded somewhere.
This presents a lot of very interesting changes for the near term. ID.me and similar security approaches are basically dead. Chain of custody proof will become more and more important.
Can stable diffusion work across more than two dimensions?
I think there has to be a better way to make long songs...
For example, you could take half the previous spectrogram, shift it to the left, and then use the inpainting algorithm to make the next bit... Do that repeatedly, while smoothly adjusting the prompt, and I think you'd get pretty good results.
And you could improve on this even more by having a non-linear time scale in the spectrograms. Have 75% of the image be linear, but the remaining 25% represent an exponentially downsampled version of history. That way, the model has access to what was happening seconds, minutes, and hours ago (although less detail for longer time periods ago).
I bet a cool riff on this would be to simply sample an ambient microphone in the workplace and use that the generate and slowly introduce matching background music that fits the current tenor of the environment. Done slowly and subtly enough I'd bet the listener may not even be entirely aware its happening.
If we could measure certain kinds of productivity it might even be useful as a way to "extend" certain highly productive ambient environments a la "music for coding".
Producing images of spectrograms is a genius idea. Great implementation!
A couple of ideas that come to mind:
- I wonder if you could separate the audio tracks of each instrument, generate separately, and then combine them. This could give more control over the generation. Alignment might be tough, though.
- If you could at least separate vocals and instrumentals, you could train a separate model for vocals (LLM for text, then text to speech, maybe). The current implementation doesn't seem to handle vocals as well as TTS models.
This opens up ideas. One thing people have tried to do with stable diffusion is create animations. Of course, they all come out pretty janky and gross, you can't get the animation smooth.
But what if what if a model was trained not on single images, but animated sequential frames, in sets, laid out on a single visual plane. So a panel might show a short sequence of a disney princess expressing a particular emotion as 16 individual frames collected as a single image. One might then be able to generate a clean animated sequence of a previously unimagined disney princess expressing any emotion the model has been trained on. Of course, with big enough models one could (if they can get it working) produced text prompted animations across a wide variety of subjects and styles.
The vocals in these tracks are so interesting. They sound like vocals, with the right tone, phonemes. and structure for the different styles and languages but no meaning.
This looks great and the idea is amazing. I tried with the prompt: "speed metal" and "speed metal with guitar riffs" and got some smooth rock-balad type music. I guess there was no heavy metal in the learning samples haha.
Fun! I tried something similar with DCGAN when it first came out, but that didn't exactly make nice noises. The conversion to and from Mel spectrograms was lossy (to put it mildly), and DCGAN, while impressive in its day, is nothing like the stuff we have today.
Interesting that it gets so good results with just fine tuning the regular SD model. I assume most of the images it's trained on are useless for learning how to generate Mel spectrograms from text, so a model trained from scratch could potentially do even better.
There's still the issue of reconstructing sound from the spectrograms. I bet it's responsible for the somewhat tinny sound we get from this otherwise very cool demo.
Interesting. I experimented a bit with the approach of using diffusion on whole audio files, but I ultimately discarded it in favor of generating various elements of music separately. I'm happy with the results of my project of composing melodies (https://www.youtube.com/playlist?list=PLoCzMRqh5SkFPG0-RIAR8...) and I still think this is the way to go and but that was before Stable Diffusion came out. These are interesting results though, maybe it can lead to something more.
It may be clearer to those of you who are smarter than me, but I guess I've only recently begun to appreciate what these experiments show--that AI graphical art, literature, music and the like will not succeed in lowering the barriers to humans making things via machines but in training humans to respond to art that is generated by machines. Art will not be challenging but designed by the algorithm to get us to like it. Since such art can be generated for essentially no cost, it will follow a simple popularity model, and will soon suck like your Netflix feed.
I’d been wondering (naively) if we’d reached the point where we can’t see any new kinds of music now that electronic synthesis allows us to make any possible sound. Changes in musical styles throughout history tend to have been brought about by people embracing new instruments or technology.
This is the most exciting thing I’ve seen in ages as it shows we may be on the verge of the next wave of new technology in music that will allow all sorts of weird and wonderful new styles to emerge. I can’t wait to see what these tools can do in the hands of artists as they become more mainstream.
[+] [-] haykmartiros|3 years ago|reply
Meanwhile, please read our about page http://riffusion.com/about
It’s all open source and the code lives at https://github.com/hmartiro/riffusion-app --> if you have a GPU you can run it yourself
This has been our hobby project for the past few months. Seeing the incredible results of stable diffusion, we were curious if we could fine tune the model to output spectrograms and then convert to audio clips. The answer to that was a resounding yes, and we became addicted to generating music from text prompts. There are existing works for generating audio or MIDI from text, but none as simple or general as fine tuning the image-based model. Taking it a step further, we made an interactive experience for generating looping audio from text prompts in real time. To do this we built a web app where you type in prompts like a jukebox, and audio clips are generated on the fly. To make the audio loop and transition smoothly, we implemented a pipeline that does img2img conditioning combined with latent space interpolation.
[+] [-] lisper|3 years ago|reply
[+] [-] jtode|3 years ago|reply
[+] [-] newobj|3 years ago|reply
[+] [-] superkuh|3 years ago|reply
It'd be really cool if you could implement an MS paint style spectrum painting or image upload into the web app for more "manual" sound generation.
[+] [-] ozten|3 years ago|reply
[+] [-] haykmartiros|3 years ago|reply
https://colab.research.google.com/drive/1FhH3HlN8Ps_Pr9OR6Qc...
[+] [-] tartakovsky|3 years ago|reply
[+] [-] rexreed|3 years ago|reply
How many paired training images / text and what was the source of your training data? Just curious to know how much fine tuning was needed to get the results and what the breadth / scope of the images were in terms of original sources to train on to get sufficient musical diversity.
[+] [-] TOMDM|3 years ago|reply
It might be the last step this AI needs to bring some extra clarity to the output.
[+] [-] jweissman|3 years ago|reply
- My strongest suggestion is finding some strategy for smoothing over the sometimes harsh-sounding edge of the sample window - Perhaps it could be filling in/passing over segments of what is sounded to user as a larger loop? Both giving it a larger window to articulate things but maybe also showcasing the interpolation more clearly… - Tone control may seem challenging but I do wonder if you couldn’t “tune” the output of the model as a whole somehow (given the spectrogram format it could be a translation/scale knob potentially?)
[+] [-] CapsAdmin|3 years ago|reply
It would be very interesting to see what the stable diffusion community that is using automatic1111 version would do with this if it were made into an extension.
[+] [-] d4rkp4ttern|3 years ago|reply
Also the key question is - would something like this ever produce something as hauntingly beautiful and unique as classical music pieces?
[+] [-] alsodumb|3 years ago|reply
On a serious note, I'd really love some advice from you on time management and how you get so much done? I love Skydio and the problems you are solving, especially on the autonomy front, are HARD. You are the VP of Autonomy there and yet also managed to get this done! You are clearly doing something right. Teach us, senpai!
[+] [-] jablongo|3 years ago|reply
[+] [-] juiiiced|3 years ago|reply
[+] [-] poslathian|3 years ago|reply
How much data is used for fine tuning? Since spectrograms are (surely?) very out of distribution for the pre training dataset, how much does value does the pre training really bring?
[+] [-] abledon|3 years ago|reply
[+] [-] dqpb|3 years ago|reply
[+] [-] rexreed|3 years ago|reply
[+] [-] sergiotapia|3 years ago|reply
[+] [-] nico|3 years ago|reply
Example prompt: “deep radio host voice saying ‘hello there’”
Kind of like a more expressive TTS?
[+] [-] rhelsing|3 years ago|reply
[+] [-] joelrunyon|3 years ago|reply
[+] [-] hanselot|3 years ago|reply
Have you already explored doing the same with voice cloning?
[+] [-] ralfd|3 years ago|reply
[+] [-] asdf333|3 years ago|reply
[+] [-] blaaaaa99a|3 years ago|reply
2 amazing AI projects. Huge respect :)
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] valdiorn|3 years ago|reply
Absolutely blows my mind.
[+] [-] seth_|3 years ago|reply
[+] [-] hyperbovine|3 years ago|reply
[+] [-] bheadmaster|3 years ago|reply
I can't wait to hear some serious AI music-making a few years from now.
[+] [-] Applejinx|3 years ago|reply
It absolutely sucks at cymbals, though. Everything sounds like realaudio :) composition's lacking, too. It's loop-y.
Set this up to make AI dubtechno or trip-hop. It likes bass and indistinctness and hypnotic repetitiveness. Might also be good at weird atonal stuff, because it doesn't inherently have any notion of what a key or mode is?
As a human musician and producer I'm super interested in the kinds of clarity and sonority we used to get out of classic albums (which the industry has kinda drifted away from for decades) so the way for this to take over for ME would involve a hell of a lot more resolution of the FFT imagery, especially in the highs, plus some way to also do another AI-ification of what different parts of the song exist (like a further layer but it controls abrupt switches of prompt)
It could probably do bad modern production fairly well even now :) exaggeration, but not much, when stuff is really overproduced it starts to get way more indistinct, and this can do indistinct. It's realaudio grade, it needs to be more like 128kbps mp3 grade.
[+] [-] tomrod|3 years ago|reply
This show me that Stable Diffusion can create anything with the following conditions:
1. Can be represented as as static item on two dimensions (their weaving together notwithstanding, it is still piece-by-piece statically built)
2. Acceptable with a certain amount of lossiness on the encoding/decoding
3. Can be presented through a medium that at some point in creation is digitally encoded somewhere.
This presents a lot of very interesting changes for the near term. ID.me and similar security approaches are basically dead. Chain of custody proof will become more and more important.
Can stable diffusion work across more than two dimensions?
[+] [-] londons_explore|3 years ago|reply
For example, you could take half the previous spectrogram, shift it to the left, and then use the inpainting algorithm to make the next bit... Do that repeatedly, while smoothly adjusting the prompt, and I think you'd get pretty good results.
And you could improve on this even more by having a non-linear time scale in the spectrograms. Have 75% of the image be linear, but the remaining 25% represent an exponentially downsampled version of history. That way, the model has access to what was happening seconds, minutes, and hours ago (although less detail for longer time periods ago).
[+] [-] seth_|3 years ago|reply
[+] [-] bane|3 years ago|reply
If we could measure certain kinds of productivity it might even be useful as a way to "extend" certain highly productive ambient environments a la "music for coding".
[+] [-] vikp|3 years ago|reply
A couple of ideas that come to mind:
- I wonder if you could separate the audio tracks of each instrument, generate separately, and then combine them. This could give more control over the generation. Alignment might be tough, though.
- If you could at least separate vocals and instrumentals, you could train a separate model for vocals (LLM for text, then text to speech, maybe). The current implementation doesn't seem to handle vocals as well as TTS models.
[+] [-] Fricken|3 years ago|reply
But what if what if a model was trained not on single images, but animated sequential frames, in sets, laid out on a single visual plane. So a panel might show a short sequence of a disney princess expressing a particular emotion as 16 individual frames collected as a single image. One might then be able to generate a clean animated sequence of a previously unimagined disney princess expressing any emotion the model has been trained on. Of course, with big enough models one could (if they can get it working) produced text prompted animations across a wide variety of subjects and styles.
[+] [-] quux|3 years ago|reply
Reminds me of the soundtrack to Nier Automata which did a similar thing: https://youtu.be/8jpJM6nc6fE
[+] [-] spyder|3 years ago|reply
[+] [-] michpoch|3 years ago|reply
Who else will AI make looking for a new job?
[+] [-] xtracto|3 years ago|reply
Great work!
[+] [-] vintermann|3 years ago|reply
Interesting that it gets so good results with just fine tuning the regular SD model. I assume most of the images it's trained on are useless for learning how to generate Mel spectrograms from text, so a model trained from scratch could potentially do even better.
There's still the issue of reconstructing sound from the spectrograms. I bet it's responsible for the somewhat tinny sound we get from this otherwise very cool demo.
[+] [-] zone411|3 years ago|reply
[+] [-] billiam|3 years ago|reply
[+] [-] orobinson|3 years ago|reply
This is the most exciting thing I’ve seen in ages as it shows we may be on the verge of the next wave of new technology in music that will allow all sorts of weird and wonderful new styles to emerge. I can’t wait to see what these tools can do in the hands of artists as they become more mainstream.
[+] [-] superb-owl|3 years ago|reply
[+] [-] talhof8|3 years ago|reply
Might be a traffic thing?
Edit: Works now. A bit laggy but it works. Brilliant!