top | item 21431071

Spleeter: Extract voice, piano, drums, etc. from any music track

1460 points| dsr12 | 6 years ago |github.com | reply

175 comments

order
[+] mwcampbell|6 years ago|reply
For your listening pleasure, here's a full-length demo. I decided to use the Jonathan Coulton classic "Re Your Brains", because I can legally share and modify his music under its Creative Commons license.

First, the original:

https://mwcampbell.us/tmp/spleeter-demo/jonathan-coulton-re-...

Now the derived stems:

Vocals: https://mwcampbell.us/tmp/spleeter-demo/jonathan-coulton-re-...

Accompaniment: https://mwcampbell.us/tmp/spleeter-demo/jonathan-coulton-re-...

Note: I'm not affiliated with this project or Mr. Coulton. I just think this is a cool project and wanted to share.

[+] M4v3R|6 years ago|reply
Wow, I've listened to several attempts at this over the years, but this one is waaay better than anything I've heard. It's almost perfect.
[+] jesuslop|6 years ago|reply
Holly cow! The separation is sort of perfect! Thanks for the demo.
[+] savrajsingh|6 years ago|reply
Did it just work or did you have to supply something beyond the original JC track?
[+] ropable|6 years ago|reply
They need to link these on the project readme as a demo.
[+] codedokode|6 years ago|reply
While it's a great technology, the result sounds somewhat robotic. On the original recording the voice sounds soft, but after separation it sounds like it is synthesized or passed through a vocoder, something is missing. The voice contains pieces of strumming sound. Guitar also sounds "blurred", as if someone cut an object from the picture and blurred the cut to make it less visible. Clap sound is distorted, on the original recording it sounds the same, but after separation it sounds different every time, as if it was filtered or compressed with low bitrate.

It is amazing how the ear manages to distinguish all the sounds without distortion.

[+] noja|6 years ago|reply
So Jonathan Coulton is now the new Suzanne Vega?
[+] voicedYoda|6 years ago|reply
I gave a talk at pycon this year about dsp [1], specifically some of the complexities surrounding this. I came across a few other ml projects that claimed to do this as well, and the biggest hold up is getting enough properly trained data, tagged appropriately, in order to let the models train correctly. in the git repo of this project they also explicitly state you need to train on your own data set, though you can use their models of your like. YMMV. I will love to try this out, as it's definitely a complex bit of audio engineering. That said, i loved learning everything i did preparing for my talk and need to finish up some other parts of the project to get the jukebox working... Maybe this will help :)

1. https://m.youtube.com/watch?v=fevxy-s0vo0

[+] lubujackson|6 years ago|reply
Seems like most music (from the 70s on at least) is recorded multi-track and the data is out there, just not accessible to anybody. If you ever watch Rick Beato videos, he takes classic songs and isolates vocal/drum/etc. tracks all the time, I'm not sure how he has access to them: https://www.youtube.com/playlist?list=PLW0NGgv1qnfzb1klL6Vw9...

But you probably don't need to bother with old recordings since there is SO MUCH music being produced via tracking software right now I feel like it should be possible to get a pretty big dataset - the difference being, of course, professional production that affects how all these things sound in the final mix.

Although... if you have enough songs with separated tracks, couldn't you just recombine tracks and adjust the settings to create a much, much broader base for training? Just a dozen songs could be shuffled around to give you a base of 10,000+ songs easily enough. That might lead to a somewhat brittle result but it would be a decent start.

[+] TheRealSteel|6 years ago|reply
I'm sure you've thought of this, but could/have the tracks from the Rock Band games be used for training?

There are thousands of them and they're separated into different instrument tracks. They even had bands re-record songs sometimes where seperate masters couldn't be found. If I recall correctly, Third Eye Blind did this for Semi-Charmed Life.

[+] _fbpt|6 years ago|reply
The SNES is a 1990s game console. Its music is generally synthesized by the SPC700 chip, from individual instruments stored in 64 kilobytes of RAM (so the instruments often sound synthetic and muffled). The advantage is that it's possible to separate out instruments.

Either:

- Programatically gather a list of all samples used in the song

- Generate many modified .spc files, each of which mutes 1 sample via editing the BRR data.

Or

- Use a modified SPC700 emulator which you can tell to skip playing a specific sample ID.

Record the original song to .wav. And for each sample, record "the song with one sample muted", and take (original song - 1 sample muted), to isolate that 1 sample. If the result is not silent, you have isolated 1 instrument from the original song.

The results may not always be perfect, and will need manual labeling of instruments, or manually merging together multiple piano instruments. But I think this process will work.

[+] jacquesm|6 years ago|reply
This is very timely. I've been working for about 3 months now on a utility that transforms mp3's to midi files. It's a hard problem and even though I'm making steady progress the end is nowhere in sight. This will give me something to benchmark against with for instance voice accompanied by piano. Thank you for making/posting this.

For an idea how this project is coming along:

https://jacquesmattheij.com/toccata.mp3

Yes, it's terrible :) This particular file the result of the following transformations:

midi file -> wav file (fluidsynth)

wav file -> midi file (my utility)

midi file -> wav file (fluidsynth once more)

wav file -> mp3 file (using lame)

Of course it also works for regular midi files (piano only for now). The reason why I use the workflow above is that it gives me a good idea how well the program works by comparing the original midi file with the output one.

But I did not yet have a way to deal with piano/voice which is a very common combination so this might really help me.

Possible applications: automatic music transcription, tutoring, giving regular pianos a midi 'out' port, using a regular piano as an arranger keyboard, instrument transformation and many others.

Having fun!

Edit: I've done a little write-up: https://jacquesmattheij.com/mp3-to-midi/

[+] IAmGraydon|6 years ago|reply
Just FYI in case you weren't aware - Ableton Live and several other DAWs have this capability built in. It's far from perfect, but great for humming a melody and then quickly turning it into MIDI.
[+] czr|6 years ago|reply
messed around with the 2stem model for a bit and it's reasonably good. I think phonicmind is still a bit better - phonicmind tends to err on the side of keeping too much, while the 2stem model tries to isolate aggressively and often damages the vocal as a result (distorting words by losing some harmonics, or losing quiet words entirely)

example:

https://files.catbox.moe/wjruiv.mp3 (phonicmind)

https://files.catbox.moe/uuzot3.mp3 (spleeter 2stem)

you can hear spleeter does better at actually taking out the bass drums, but phonicmind never loses or distorts any part of the vocal, while 2stem occasionally sounds like singing is through metal tube (harmonics are missing). will try to read instructions more carefully and see if there's some way to fix.

[+] roryokane|6 years ago|reply
For those who, like me, hadn’t heard of PhonicMind before, it’s an online service at https://phonicmind.com/ that charges $4 to $1.5 per song to separate out vocals, drums, bass, and the rest of the sounds. You can upload any audio file to that website and get a 30-second preview of separated parts for it.
[+] lreichold|6 years ago|reply
An interesting alternative approach for instrument sound separation is to use a fused audio + video model. So, given that you also have video of the instruments being played, you can perform this separation with higher fidelity.

I was fascinated by the work done by “The Sound of Pixels” project at MIT.

http://sound-of-pixels.csail.mit.edu/

[+] renaudg|6 years ago|reply
That’s quite clever but not really practical : instruments heard in most music produced today aren’t "played" by humans.
[+] ooobo|6 years ago|reply
Gave this a go, it's an easy install with pip, and results are pretty quick even on an old macbook. Splits into 2stems (vocals/accompaniment) on some random songs I chose actually quite good using the pretrained models provided. Of course, ripping the vocals out of the accompaniment takes out a good chunk of the middle frequencies so some songs sound a bit wonky. Worth a play if you are interested.
[+] tomrod|6 years ago|reply
Same thoughts here. I ran Thriller, Alligator by Of Monsters and Men, and In Hell I'll be in Good Company by The Dead South on the 2 / 5 / 4 stems, respectively. Impressive results. Definitely agree that some of the middle frequencies show some error.

It would be really cool to create "music mappers"/life sounds tracks like what you can do with pictures & art styles (e.g. https://medium.com/tensorflow/neural-style-transfer-creating...)

[+] sehugg|6 years ago|reply
It's really good on the 2-stem stuff. On the 4-stem model, it's a bit shy about the bass part, and parts drift in and out. I'd like to try it on a FLAC.
[+] lrobinovitch|6 years ago|reply
Same, Rage Against The Machine - Killing In the Name came out sounding great. Very cool.
[+] iamchrisle|6 years ago|reply
Non-open source products that also separate vocals from music if you need something more "professional".

One-click process: Xtrax Stems 2 (https://audionamix.com/technology/xtrax-stems/)

Professional: ADX Trax Pro 3 (https://audionamix.com/technology/adx-trax-pro/)

Both products use a server which have a much larger pre-trained models. The professional one has added features such as handling sibilance, GUI to edit note following as a guide for the models, and an editor tool for extracting using harmonics.

(Note: I don't work for this company. I do pay for / use their products, and I also happen to know someone who works there.)

[+] xamuel|6 years ago|reply
I wonder how it would fare on Pink Floyd's "Sheep", where vocals seamlessly transform into instrumentals and it's impossible to tell where one ends and the other begins. https://www.youtube.com/watch?v=3-oJt_5JvV4 (skip to around 1:40)
[+] Intermernet|6 years ago|reply
I'd love to see how this compares with Celemony Melodyne. As far as I've been able to determine, Melodyne doesn't use ML, but it's hard to find out exactly what it does use.

Either way, an open source competitor to Melodyne is a welcome addition!

[+] dspig|6 years ago|reply
There is a patent for Melodyne that describes looking for harmonics vs time in FFTs, then heuristics for deciding which belong to one note and where it starts and ends, then assigning some of the resudual energy (e.g. noisy onset) to each note.
[+] SyneRyder|6 years ago|reply
That's the second time I've seen someone mention Melodyne for separating vocals from a full song source - I don't think that's something it can do? Melodyne is for tuning vocals / instruments & correcting timing on already isolated tracks.
[+] matchagaucho|6 years ago|reply
I’ve always assumed Melodyne uses FFT bins.
[+] bravura|6 years ago|reply
Is the paper, "Spleeter: A Fast And State-of-the Art Music Source Separation Tool With Pre-trained Models", available yet? What is the methodology?
[+] davidy123|6 years ago|reply
I look forward to a day I can click a button to watch videos online without any unnecessary and distracting background music (though it would be better if there were an option and precedent to offer unornamented narrative in video players). The next step after this would be to have live 'music cancelling' headphones for the grocery store (if such a thing still exists).
[+] swagasaurus-rex|6 years ago|reply
Wow. Office background noise mute.

The headphones can filter out speech that isn't above a certain threshold. Coworkers nearby can be heard loud and clearly.

Music can play at volume then quiet itself when it detects a person speaking directly to you.

Maybe even a training button to inform it that it is false-positive-ing background noise, or true negative and silencing a co worker you would like to hear.

[+] iagooar|6 years ago|reply
Does it work with spoken word as well? My use case: improve podcast quality by extracting the vocals only, and leaving out all background and accidental noise.
[+] ssttoo|6 years ago|reply
Not free nor open source but you can try a plugin called izotope Rx for this purpose
[+] cma|6 years ago|reply
Is there anything like this for images? Meaning essentially trying to decompose back into photoshop layers. Wouldn't be feasible for lots of stuff that is completely opaquely covering something, but I'm thinking for things like recoloring a screen print, etc.
[+] alez|6 years ago|reply
Tried it on “Halleluwah” by CAN, had to hear those drums:

https://soundcloud.com/alezzzz/can-halleluwah-drums-extracte...

Finding drum breaks in music is very time consuming. This is gonna be amazing for music production. Think how 90s jungle would’ve been if they had access to every drum take ever

[+] smrq|6 years ago|reply
Wow, this is the isolated track I never knew I needed to hear.
[+] exogen|6 years ago|reply
The extracted vocals sound great! But the resulting accompaniment tracks I've heard so far (tried on a handful of songs) aren't of usable quality for most purposes where you'd want an instrumental track – they're too sonically mangled.

Since people are often interested in doing this for a handful of specific tracks and not necessarily en masse, I'd be curious about what a human-assisted version of this could look like and whether you really could get near-perfect results...

What if you explicitly selected portions of the track you knew had vocals, so it could (1) know to leave the rest alone and (2) know what the backing track for the specific song naturally sounds like when there's no singing happening? It could try to match that sonic profile more carefully in the vocal-removed version.

Or what if you could give it even more info, and record yourself/another singing (isolated) over the track? Then it would have information about what phonemes it should expect to find and remove (and whatever effects like reverb are applied to them).

[+] BeeBoBub|6 years ago|reply
I am working on a product which makes use of this technology. I generate vocal pitch visualizations for karaoke

http://pitchperfected.io

[+] noja|6 years ago|reply
Cool - but your website needs some work. It looks like a landing page to gather interest rather than something backed by a real product. Show us some videos and singing, before and after, etc.
[+] mh-|6 years ago|reply
FYI your email confirmation is going straight to spam on gmail. I'd recommend reaching out to Mailchimp.
[+] gdsdfe|6 years ago|reply
Wait the implications of this are huge for electronic music DJs
[+] tomrod|6 years ago|reply
Audio Neural Transfer Learning could be amazing.
[+] exikyut|6 years ago|reply
The audio (^F soundcloud) sounds a little warbly... if that can be largely mitigated, then yes, remixes will never be the same