top | item 39992817

Show HN: Sonauto – A more controllable AI music creator

454 points| zaptrem | 1 year ago |sonauto.ai

Hey HN,

My cofounder and I trained an AI music generation model and after a month of testing we're launching 1.0 today. Ours is interesting because it's a latent diffusion model instead of a language model, which makes it more controllable: https://sonauto.ai/

Others do music generation by training a Vector Quantized Variational Autoencoder like Descript Audio Codec (https://github.com/descriptinc/descript-audio-codec) to turn music into tokens, then training an LLM on those tokens. Instead, we ripped the tokenization part off and replaced it with a normal variational autoencoder bottleneck (along with some other important changes to enable insane compression ratios). This gave us a nice, normally distributed latent space on which to train a diffusion transformer (like Sora). Our diffusion model is also particularly interesting because it is the first audio diffusion model to generate coherent lyrics!

We like diffusion models for music generation because they have some interesting properties that make controlling them easier (so you can make your own music instead of just taking what the machine gives you). For example, we have a rhythm control mode where you can upload your own percussion line or set a BPM. Very soon you'll also be able to generate proper variations of an uploaded or previously generated song (e.g., you could even sing into Voice Memos for a minute and upload that!). @Musicians of HN, try uploading your songs and using Rhythm Control/let us know what you think! Our goal is to enable more of you, not replace you.

For example, we turned this drum line (https://sonauto.ai/songs/uoTKycBghUBv7wA2YfNz) into this full song (https://sonauto.ai/songs/KSK7WM1PJuz1euhq6lS7 skip to 1:05 if impatient) or this other song I like better (https://sonauto.ai/songs/qkn3KYv0ICT9kjWTmins - we accidentally compressed it with AAC instead of Opus which hurt quality, though)

We also like diffusion models because while they're expensive to train, they're cheap to serve. We built our own efficient inference infrastructure instead of using those expensive inference as a service startups that are all the rage. That's why we're making generations on our site free and unlimited for as long as possible.

We'd love to answer your questions. Let us know what you think of our first model! https://sonauto.ai/

235 comments

order
[+] adrianh|1 year ago|reply
I'm interested to hear more about your statement of "Our goal is to enable more of you, not replace you."

Speaking as a musician who plays real instruments (as opposed to electronic production): how does this help me? And how does this enable more of me?

I am asking with an open mind, with no cynicism intended.

[+] zaptrem|1 year ago|reply
If the future of music was truly just typing some text into a box and taking or leaving what the machine gives you that would be kinda depressing.

We want you to be able to upload recordings of your real instruments and do all sorts of cool things with them (e.g., transform them, generate vocals for your guitar riff, use the melody as a jazz song, or just get some inspiration for what to add next).

IMO AI alone will never be able to touch hearts like real people do, but people using AI will be able to like never before.

[+] LZ_Khan|1 year ago|reply
Inspiration? You can generate hundreds of ideas in a day. The tracks will not be perfect but that's where actual musicians can take the ideas/themes from the tracks and perfect it.

In this way it is a tool only useful to expert musicians.

[+] 93po|1 year ago|reply
When Suno came out I spent literally hours/days playing around with it to generate music, and came out with some that's really close to good, and good enough I've gone back to listen to a few. I'd love the tooling to take a premise and be able to tweak it to my liking without spending 1000 hours learning specific software and without thousands of hours learning to play an instrument or learning to sing.
[+] suyash|1 year ago|reply
That is just 'marketing speak' so as long you are their customers, they need to make money from users who will be using their service to make music.
[+] whoomp12341|1 year ago|reply
same thing with AI code writing.

Its a good muse, but I wouldn't trust what it makes out of the gate

[+] cush|1 year ago|reply
There's a lot of negative comments here, but these are the earliest days and generating entire songs is kind of the hello world of this tech.

There's always going to be a balance between creating high level tools like this with no dials and low level tools with finer control, and while this touts itself as being "more controllable", it's clearly not there. But, the same way Adobe has integrated outpainting and generative fill into Photoshop, it's only a matter of time before products like this are built into Ableton and VSTs - where a creator can highlight a bar or two and ask your AI to make the the snippet more ethereal, create a bridge between the verse and the sax solo, or help you with an outro.

That said, similar to generating basic copy for a marketing site, these tools will be great for generating cheap background music but not much else, but any musician, marketing agency, or film-maker worth their salt is going to need very specifically branded music for their needs, and they're likely willing to pay for a real licence to something audiences will recognize, using generative AI and tools to remix the content to their specific need.

[+] boringg|1 year ago|reply
I want to say two things -- one congrats - I am sure your team has been working exceptionally hard to develop this - and the songs sound reasonable good for AI! Two I am soo competely unenthusiastic about AI music and it infiltrating the music world - all of it sounds like fingernails on a chalkboard. Just mainstream overproduced low quality radio music. I know its a stepping stone but it kills me to listen to it right now.
[+] _DeadFred_|1 year ago|reply
80% of music is familiarity, 20% novelty, yet the majority of peoples' time goes into getting the 80% down so that they can add their 20%.

Look at current music production and compare it to past. Older music seems so much simpler. It was so much easier to come up with that 20% 'novel' when pop/recorded music was new. Ironically I think AI freeing people to focus on that 20% is going to add a lot of creativity to music, not reduce it.

I say this as someone who hates the concept of AI music. I'm actually really excited to see what it enables/creates (but I don't want to use it, even though I really could use it for vocals that I currently pay others to do for me).

I'll be here making my bad knockoffs of bad synth pop bands having fun and taking weeks to do 5% of what kids these days will start off as their entry point, with my 20% creativity ignored because my music sounds 'off' when I can't get the 80% familiar down.

People thought synthesizers were the end of music, yet Switched on Bach begot Jean Michel Jarre begot Kate Bush and on and on.

[+] fennecbutt|1 year ago|reply
I really feel like the popularity of diffusion has made it far too shallow.

Why diffuse an entire track? We should be building these models to create music the same way that humans do, by diffusing samples, then having the model build the song using samples in a proper sequencer, diffuse vocals etc.

Problem with Suno etc, is that as other people have mentioned, you can't iterate or adjust anything. Saying "make the drums a little punchier and faster paced right after the chorus" is a really tough query to process if you've diffused the whole track rather than built it up.

Same thing with LLM story writing, the writing needs a good foundation, more generating information about the world and history and then generating a story taking that stuff into account, vs a simple "write me a story about x"

[+] zaptrem|1 year ago|reply
I completely agree on the editing aspect. However if you want to generate five stem tracks, then all five tracks must have the full bandwidth of your auto encoder. Accordingly each inference or training staff would take much more compute for the same result. That’s why we’d prefer to do it all together and split after.
[+] saaaaaam|1 year ago|reply
How worried are you about being sued? Seems like your training data probably includes quite a bit of copyright protected stuff. Just listened to the “blue scoobie doo” example and the influences are fairly obvious. With record companies getting super litigious about this, is that a concern? Or did you licence your training data?
[+] garyrob|1 year ago|reply
My hobby is songwriting. (Example: https://www.youtube.com/watch?v=Kjng3UoKkGk)

I play guitar, but I'm not much of a guitarist or singer. I really like songwriting, not trying to be polished as a performer. So I intermittently look into the AI world to see whether it has tools I could use to generate a higher-quality song demo than I could do on my own.

I've been looking for something that could take a chord progression and style instructions and create a decent backing track for a singer to sing over.

But your saying "Very soon you'll also be able to generate proper variations of an uploaded or previously generated song (e.g., you could even sing into Voice Memos for a minute and upload that!)" is very intriguing. I mean, I can sing and play, it just isn't very professional. But if I could then have an AI take what I did and just... make it better... that would be kind of awesome.

In fact, I believe you could have a very big market among songwriters if you could do that. What I would love to see is this:

My guitar parts are typically not just strummed, but involve picking, sometimes fairly intricate. I'm just not that good at it. It would be fantastic to have an AI that would just take would I played and fix it so that it's more perfect.

And then to have a tool where I could say, "OK, now add a bass part," and "OK, now add drums" would be awesome.

[+] LastTrain|1 year ago|reply
That song is quite nice, so is the performance. It would, IMO, would be less good if it were 'fixed' to be more perfect.
[+] zaptrem|1 year ago|reply
Awesome to hear this resonates with you! If you join our Discord server I'll ping @everyone when improvements are ready.
[+] dwallin|1 year ago|reply
I think the problem here is the same one as the other current music generation services. Iteration is so important to creativity and right now you can't really properly iterate. In order to get the right song you just spray and pray and keep generating until one that is sufficient arrives or you give up. I know you hint at this being a future direction of development but in my opinion it's a key feature to take these services beyond toys.

I think it's better to think of the process of finding the right song as a search algorithm through the space of all possible songs. The current approach just uses a "pick a random point in a general area". Once we find something that is roughly correct we need something that lets us iteratively tweak the aspects that are not quite right, decreasing the search space and allowing us to iteratively take smaller and smaller steps in defined directions.

[+] rcarmo|1 year ago|reply
Nice, but Google login is a no-go for me (or any form of social login, really).
[+] Recursing|1 year ago|reply
Congratulations on the launch!

I was recently really impressed by the state of AI-generated music, after listening to the April Fools LessWrong album https://www.lesswrong.com/posts/YMo5PuXnZDwRjhHhE/lesswrong-... . They claim it took them ~100 hours to generate 15 songs.

Can't wait for the day I can instantly generate a song based on a random blog post or group chat history, this seems like a step in that direction

[+] disqard|1 year ago|reply
Perhaps not exactly "instantly generate a song based on a random blog post or group chat history", but more like "instantly generate a song based on an input prompt sentence" is suno.ai -- you should check it out!
[+] echelon|1 year ago|reply
This space is going to get very full, very fast. Udio just launched and improves upon "SOTA" Suno. This will just keep coming.

Focus on product. Give actual music producers something they'll find useful. These fad, meme products will compete on edge model capability for 99% of users and ignore serving actual music producers.

I'd like a product with more control, and it doesn't appear Suno or Udio are interested in this.

[+] internet101010|1 year ago|reply
Exactly. As of now, Suno can be used as template but you still need to go to DAW and make it from scratch. So... individual tracks for each instrument/vocals that can be exported and brought into DAW is what is needed. For me anyway.
[+] mrnotcrazy|1 year ago|reply
I'm not sure its that they aren't interested, I think its just really hard.
[+] jsf01|1 year ago|reply
This is ridiculously fun. Congrats on the launch! I took inspiration from “There I Ruined It” and grabbed lyrics from various popular songs to have the AI sing them in the style of other artists. It sometimes took a few attempts, but it honestly did a great job. You got a chuckle out of my friends and family. Also loved that I didn’t have to enter a credit card in order to try it out.
[+] ibdf|1 year ago|reply
I was just trying similar apps last week and I was so frustrated with the amount of options and menus to get through before I could generate anything. Not to mention the fact that half of these services ended up asking me to pay per setting. I have to say this was the least painful service to use this far. Pretty impressive output for so little input.
[+] zaptrem|1 year ago|reply
Thanks! We have lots of fun dials for people who want them but they're all hidden by default and shouldn't be needed.
[+] CuriouslyC|1 year ago|reply
I don't feel like prompt understanding is very good, I don't think I really ever got close to what I wanted with any of the attempts I made, I imagine learning the model tags and building some intuition might help but I wouldn't bother with that unless I was tinkering with a local model.

Some things it made sounded ok, but I feel like the average generation quality wasn't fantastic. It did a folk guitar melody and a vocoded thrash metal voice that I thought sounded pretty legit, but mostly vocals had an ear grating quality and everything had a bit of low bitrate vibe.

To be honest though, I don't think you need to try and outcompete Suno. I think you want to get into DAWs and VSTs and become the tool all the best producers in the world use. Spit out stems, and train your model on less processed sounds because things like matching reverb/delay and pre-squashed dynamics are a pain in the ass to work around.

Suno is trying to battle a large established industry that is actually very creator friendly and accessible. If you choose to instead serve that industry and enable it I think that's the winning play.

[+] zaptrem|1 year ago|reply
The vast majority of our time was spent figuring out the model architecture and large-scale distributed training, and step 2 (starting now) is scaling everything up. Prompt understanding and audio quality will get significantly better once we swap in a larger text embedding model.

Thanks for the feedback re: DAWs, though! That would be really cool. Maybe we can tag tracks based on the effects applied to them to allow this to be more controllable.

[+] cchance|1 year ago|reply
Begs the question given this is diffusion based how much of the "ipadapter/faceid/controlnet" tech can be brought over, what would a audio-faceid or audio-ipadapter look like for something like this.
[+] lta|1 year ago|reply
I've tried to look a little bit around but couldn't find anything, so I'll ask here.

Any plans to release the model(s) under an open license ?

[+] zaptrem|1 year ago|reply
This would be so cool, but we need to think more about how we could do it and make enough money in the future to train more models with even cooler features.
[+] echelon|1 year ago|reply
All models for all types of content will eventually have open source equivalents. The game is to build a great product.
[+] jedisct1|1 year ago|reply
Please offer alternatives to Google to sign-in.
[+] digging|1 year ago|reply
> Sign in with Google

Well, maybe I'll try out the next AI music creator posted on HN.

[+] pachico|1 year ago|reply
Good luck! I just tried it and the interface was a bit confusing. It allowed me to only fill the last input in the form, which is usually a bit counterintuitive.

I presentes this prompt "Noir detective music from the 60s. Low tempo, trumpet and walking bass" and got back a one-note only song that has nothing to do with the prompt if not for some lyrics that were a bit ridiculous.

This is just feedback, I'm passionately expecting something like this to surprise me but I know it's really hard!

Happy to share the song/project/account, if you tell me how to :)

[+] ionwake|1 year ago|reply
I dont know about the scene but i thought this was great! I was given 3 tracks, I have to say one had no sort of beat to it, so it was like noise, but the other 2 were fantastic. great stuff!
[+] zaptrem|1 year ago|reply
Thanks! We have a BPM assist that can enforce rhythm as well, so you could try that, too!