For your listening pleasure, here's a full-length demo. I decided to use the Jonathan Coulton classic "Re Your Brains", because I can legally share and modify his music under its Creative Commons license.
While it's a great technology, the result sounds somewhat robotic. On the original recording the voice sounds soft, but after separation it sounds like it is synthesized or passed through a vocoder, something is missing. The voice contains pieces of strumming sound. Guitar also sounds "blurred", as if someone cut an object from the picture and blurred the cut to make it less visible. Clap sound is distorted, on the original recording it sounds the same, but after separation it sounds different every time, as if it was filtered or compressed with low bitrate.
It is amazing how the ear manages to distinguish all the sounds without distortion.
I gave a talk at pycon this year about dsp [1], specifically some of the complexities surrounding this. I came across a few other ml projects that claimed to do this as well, and the biggest hold up is getting enough properly trained data, tagged appropriately, in order to let the models train correctly. in the git repo of this project they also explicitly state you need to train on your own data set, though you can use their models of your like. YMMV. I will love to try this out, as it's definitely a complex bit of audio engineering. That said, i loved learning everything i did preparing for my talk and need to finish up some other parts of the project to get the jukebox working... Maybe this will help :)
Seems like most music (from the 70s on at least) is recorded multi-track and the data is out there, just not accessible to anybody. If you ever watch Rick Beato videos, he takes classic songs and isolates vocal/drum/etc. tracks all the time, I'm not sure how he has access to them: https://www.youtube.com/playlist?list=PLW0NGgv1qnfzb1klL6Vw9...
But you probably don't need to bother with old recordings since there is SO MUCH music being produced via tracking software right now I feel like it should be possible to get a pretty big dataset - the difference being, of course, professional production that affects how all these things sound in the final mix.
Although... if you have enough songs with separated tracks, couldn't you just recombine tracks and adjust the settings to create a much, much broader base for training? Just a dozen songs could be shuffled around to give you a base of 10,000+ songs easily enough. That might lead to a somewhat brittle result but it would be a decent start.
I'm sure you've thought of this, but could/have the tracks from the Rock Band games be used for training?
There are thousands of them and they're separated into different instrument tracks. They even had bands re-record songs sometimes where seperate masters couldn't be found. If I recall correctly, Third Eye Blind did this for Semi-Charmed Life.
The SNES is a 1990s game console. Its music is generally synthesized by the SPC700 chip, from individual instruments stored in 64 kilobytes of RAM (so the instruments often sound synthetic and muffled). The advantage is that it's possible to separate out instruments.
Either:
- Programatically gather a list of all samples used in the song
- Generate many modified .spc files, each of which mutes 1 sample via editing the BRR data.
Or
- Use a modified SPC700 emulator which you can tell to skip playing a specific sample ID.
Record the original song to .wav. And for each sample, record "the song with one sample muted", and take (original song - 1 sample muted), to isolate that 1 sample. If the result is not silent, you have isolated 1 instrument from the original song.
The results may not always be perfect, and will need manual labeling of instruments, or manually merging together multiple piano instruments. But I think this process will work.
This is very timely. I've been working for about 3 months now on a utility that transforms mp3's to midi files. It's a hard problem and even though I'm making steady progress the end is nowhere in sight. This will give me something to benchmark against with for instance voice accompanied by piano. Thank you for making/posting this.
Yes, it's terrible :) This particular file the result of the following transformations:
midi file -> wav file (fluidsynth)
wav file -> midi file (my utility)
midi file -> wav file (fluidsynth once more)
wav file -> mp3 file (using lame)
Of course it also works for regular midi files (piano only for now). The reason why I use the workflow above is that it gives me a good idea how well the program works by comparing the original midi file with the output one.
But I did not yet have a way to deal with piano/voice which is a very common combination so this might really help me.
Possible applications: automatic music transcription, tutoring, giving regular pianos a midi 'out' port, using a regular piano as an arranger keyboard, instrument transformation and many others.
Just FYI in case you weren't aware - Ableton Live and several other DAWs have this capability built in. It's far from perfect, but great for humming a melody and then quickly turning it into MIDI.
messed around with the 2stem model for a bit and it's reasonably good. I think phonicmind is still a bit better - phonicmind tends to err on the side of keeping too much, while the 2stem model tries to isolate aggressively and often damages the vocal as a result
(distorting words by losing some harmonics, or losing quiet words entirely)
you can hear spleeter does better at actually taking out the bass drums, but phonicmind never loses or distorts any part of the vocal, while 2stem occasionally sounds like singing is through metal tube (harmonics are missing). will try to read instructions more carefully and see if there's some way to fix.
For those who, like me, hadn’t heard of PhonicMind before, it’s an online service at https://phonicmind.com/ that charges $4 to $1.5 per song to separate out vocals, drums, bass, and the rest of the sounds. You can upload any audio file to that website and get a 30-second preview of separated parts for it.
An interesting alternative approach for instrument sound separation is to use a fused audio + video model. So, given that you also have video of the instruments being played, you can perform this separation with higher fidelity.
I was fascinated by the work done by “The Sound of Pixels” project at MIT.
Gave this a go, it's an easy install with pip, and results are pretty quick even on an old macbook. Splits into 2stems (vocals/accompaniment) on some random songs I chose actually quite good using the pretrained models provided. Of course, ripping the vocals out of the accompaniment takes out a good chunk of the middle frequencies so some songs sound a bit wonky.
Worth a play if you are interested.
Same thoughts here. I ran Thriller, Alligator by Of Monsters and Men, and In Hell I'll be in Good Company by The Dead South on the 2 / 5 / 4 stems, respectively. Impressive results. Definitely agree that some of the middle frequencies show some error.
It's really good on the 2-stem stuff. On the 4-stem model, it's a bit shy about the bass part, and parts drift in and out. I'd like to try it on a FLAC.
Both products use a server which have a much larger pre-trained models. The professional one has added features such as handling sibilance, GUI to edit note following as a guide for the models, and an editor tool for extracting using harmonics.
(Note: I don't work for this company. I do pay for / use their products, and I also happen to know someone who works there.)
I wonder how it would fare on Pink Floyd's "Sheep", where vocals seamlessly transform into instrumentals and it's impossible to tell where one ends and the other begins. https://www.youtube.com/watch?v=3-oJt_5JvV4 (skip to around 1:40)
I'd love to see how this compares with Celemony Melodyne. As far as I've been able to determine, Melodyne doesn't use ML, but it's hard to find out exactly what it does use.
Either way, an open source competitor to Melodyne is a welcome addition!
There is a patent for Melodyne that describes looking for harmonics vs time in FFTs, then heuristics for deciding which belong to one note and where it starts and ends, then assigning some of the resudual energy (e.g. noisy onset) to each note.
That's the second time I've seen someone mention Melodyne for separating vocals from a full song source - I don't think that's something it can do? Melodyne is for tuning vocals / instruments & correcting timing on already isolated tracks.
methodology is a separate u-net per instrument type to predict a soft mask in spectrogram space (time x frequency), then they apply that mask to the input audio. fairly standard.
I look forward to a day I can click a button to watch videos online without any unnecessary and distracting background music (though it would be better if there were an option and precedent to offer unornamented narrative in video players). The next step after this would be to have live 'music cancelling' headphones for the grocery store (if such a thing still exists).
The headphones can filter out speech that isn't above a certain threshold. Coworkers nearby can be heard loud and clearly.
Music can play at volume then quiet itself when it detects a person speaking directly to you.
Maybe even a training button to inform it that it is false-positive-ing background noise, or true negative and silencing a co worker you would like to hear.
This is incredible. I made an example using David Bowie's "Changes". A bit robotic, but even the echo is still present in the vocal track. https://www.youtube.com/watch?v=KPlmrq_rAzQ
Does it work with spoken word as well? My use case: improve podcast quality by extracting the vocals only, and leaving out all background and accidental noise.
I wonder how this compares to Open Unmix (https://github.com/sigsep/open-unmix-pytorch), that one calls itself state-of-the-art as well and is done in collaboration with Sony from what I see of their paper.
Is there anything like this for images? Meaning essentially trying to decompose back into photoshop layers. Wouldn't be feasible for lots of stuff that is completely opaquely covering something, but I'm thinking for things like recoloring a screen print, etc.
Finding drum breaks in music is very time consuming. This is gonna be amazing for music production. Think how 90s jungle would’ve been if they had access to every drum take ever
The extracted vocals sound great! But the resulting accompaniment tracks I've heard so far (tried on a handful of songs) aren't of usable quality for most purposes where you'd want an instrumental track – they're too sonically mangled.
Since people are often interested in doing this for a handful of specific tracks and not necessarily en masse, I'd be curious about what a human-assisted version of this could look like and whether you really could get near-perfect results...
What if you explicitly selected portions of the track you knew had vocals, so it could (1) know to leave the rest alone and (2) know what the backing track for the specific song naturally sounds like when there's no singing happening? It could try to match that sonic profile more carefully in the vocal-removed version.
Or what if you could give it even more info, and record yourself/another singing (isolated) over the track? Then it would have information about what phonemes it should expect to find and remove (and whatever effects like reverb are applied to them).
Cool - but your website needs some work. It looks like a landing page to gather interest rather than something backed by a real product. Show us some videos and singing, before and after, etc.
[+] [-] mwcampbell|6 years ago|reply
First, the original:
https://mwcampbell.us/tmp/spleeter-demo/jonathan-coulton-re-...
Now the derived stems:
Vocals: https://mwcampbell.us/tmp/spleeter-demo/jonathan-coulton-re-...
Accompaniment: https://mwcampbell.us/tmp/spleeter-demo/jonathan-coulton-re-...
Note: I'm not affiliated with this project or Mr. Coulton. I just think this is a cool project and wanted to share.
[+] [-] M4v3R|6 years ago|reply
[+] [-] 8bitsrule|6 years ago|reply
https://www.youtube.com/watch?v=KPlmrq_rAzQ
[+] [-] jesuslop|6 years ago|reply
[+] [-] savrajsingh|6 years ago|reply
[+] [-] ropable|6 years ago|reply
[+] [-] unknown|6 years ago|reply
[deleted]
[+] [-] codedokode|6 years ago|reply
It is amazing how the ear manages to distinguish all the sounds without distortion.
[+] [-] noja|6 years ago|reply
[+] [-] voicedYoda|6 years ago|reply
1. https://m.youtube.com/watch?v=fevxy-s0vo0
[+] [-] lubujackson|6 years ago|reply
But you probably don't need to bother with old recordings since there is SO MUCH music being produced via tracking software right now I feel like it should be possible to get a pretty big dataset - the difference being, of course, professional production that affects how all these things sound in the final mix.
Although... if you have enough songs with separated tracks, couldn't you just recombine tracks and adjust the settings to create a much, much broader base for training? Just a dozen songs could be shuffled around to give you a base of 10,000+ songs easily enough. That might lead to a somewhat brittle result but it would be a decent start.
[+] [-] TheRealSteel|6 years ago|reply
There are thousands of them and they're separated into different instrument tracks. They even had bands re-record songs sometimes where seperate masters couldn't be found. If I recall correctly, Third Eye Blind did this for Semi-Charmed Life.
[+] [-] _fbpt|6 years ago|reply
Either:
- Programatically gather a list of all samples used in the song
- Generate many modified .spc files, each of which mutes 1 sample via editing the BRR data.
Or
- Use a modified SPC700 emulator which you can tell to skip playing a specific sample ID.
Record the original song to .wav. And for each sample, record "the song with one sample muted", and take (original song - 1 sample muted), to isolate that 1 sample. If the result is not silent, you have isolated 1 instrument from the original song.
The results may not always be perfect, and will need manual labeling of instruments, or manually merging together multiple piano instruments. But I think this process will work.
[+] [-] jacquesm|6 years ago|reply
For an idea how this project is coming along:
https://jacquesmattheij.com/toccata.mp3
Yes, it's terrible :) This particular file the result of the following transformations:
midi file -> wav file (fluidsynth)
wav file -> midi file (my utility)
midi file -> wav file (fluidsynth once more)
wav file -> mp3 file (using lame)
Of course it also works for regular midi files (piano only for now). The reason why I use the workflow above is that it gives me a good idea how well the program works by comparing the original midi file with the output one.
But I did not yet have a way to deal with piano/voice which is a very common combination so this might really help me.
Possible applications: automatic music transcription, tutoring, giving regular pianos a midi 'out' port, using a regular piano as an arranger keyboard, instrument transformation and many others.
Having fun!
Edit: I've done a little write-up: https://jacquesmattheij.com/mp3-to-midi/
[+] [-] IAmGraydon|6 years ago|reply
[+] [-] czr|6 years ago|reply
example:
https://files.catbox.moe/wjruiv.mp3 (phonicmind)
https://files.catbox.moe/uuzot3.mp3 (spleeter 2stem)
you can hear spleeter does better at actually taking out the bass drums, but phonicmind never loses or distorts any part of the vocal, while 2stem occasionally sounds like singing is through metal tube (harmonics are missing). will try to read instructions more carefully and see if there's some way to fix.
[+] [-] roryokane|6 years ago|reply
[+] [-] lreichold|6 years ago|reply
I was fascinated by the work done by “The Sound of Pixels” project at MIT.
http://sound-of-pixels.csail.mit.edu/
[+] [-] renaudg|6 years ago|reply
[+] [-] ooobo|6 years ago|reply
[+] [-] tomrod|6 years ago|reply
It would be really cool to create "music mappers"/life sounds tracks like what you can do with pictures & art styles (e.g. https://medium.com/tensorflow/neural-style-transfer-creating...)
[+] [-] sehugg|6 years ago|reply
[+] [-] lrobinovitch|6 years ago|reply
[+] [-] iamchrisle|6 years ago|reply
One-click process: Xtrax Stems 2 (https://audionamix.com/technology/xtrax-stems/)
Professional: ADX Trax Pro 3 (https://audionamix.com/technology/adx-trax-pro/)
Both products use a server which have a much larger pre-trained models. The professional one has added features such as handling sibilance, GUI to edit note following as a guide for the models, and an editor tool for extracting using harmonics.
(Note: I don't work for this company. I do pay for / use their products, and I also happen to know someone who works there.)
[+] [-] xamuel|6 years ago|reply
[+] [-] SemiTom|6 years ago|reply
[+] [-] Intermernet|6 years ago|reply
Either way, an open source competitor to Melodyne is a welcome addition!
[+] [-] dspig|6 years ago|reply
[+] [-] SyneRyder|6 years ago|reply
[+] [-] matchagaucho|6 years ago|reply
[+] [-] bravura|6 years ago|reply
[+] [-] czr|6 years ago|reply
methodology is a separate u-net per instrument type to predict a soft mask in spectrogram space (time x frequency), then they apply that mask to the input audio. fairly standard.
[+] [-] davidy123|6 years ago|reply
[+] [-] swagasaurus-rex|6 years ago|reply
The headphones can filter out speech that isn't above a certain threshold. Coworkers nearby can be heard loud and clearly.
Music can play at volume then quiet itself when it detects a person speaking directly to you.
Maybe even a training button to inform it that it is false-positive-ing background noise, or true negative and silencing a co worker you would like to hear.
[+] [-] huskyr|6 years ago|reply
[+] [-] iagooar|6 years ago|reply
[+] [-] ssttoo|6 years ago|reply
[+] [-] clashmeifyoucan|6 years ago|reply
[+] [-] clashmeifyoucan|6 years ago|reply
[+] [-] cma|6 years ago|reply
[+] [-] exikyut|6 years ago|reply
https://news.ycombinator.com/item?id=20978055
https://news.ycombinator.com/item?id=21220458
[+] [-] alez|6 years ago|reply
https://soundcloud.com/alezzzz/can-halleluwah-drums-extracte...
Finding drum breaks in music is very time consuming. This is gonna be amazing for music production. Think how 90s jungle would’ve been if they had access to every drum take ever
[+] [-] smrq|6 years ago|reply
[+] [-] exogen|6 years ago|reply
Since people are often interested in doing this for a handful of specific tracks and not necessarily en masse, I'd be curious about what a human-assisted version of this could look like and whether you really could get near-perfect results...
What if you explicitly selected portions of the track you knew had vocals, so it could (1) know to leave the rest alone and (2) know what the backing track for the specific song naturally sounds like when there's no singing happening? It could try to match that sonic profile more carefully in the vocal-removed version.
Or what if you could give it even more info, and record yourself/another singing (isolated) over the track? Then it would have information about what phonemes it should expect to find and remove (and whatever effects like reverb are applied to them).
[+] [-] BeeBoBub|6 years ago|reply
http://pitchperfected.io
[+] [-] noja|6 years ago|reply
[+] [-] mh-|6 years ago|reply
[+] [-] gdsdfe|6 years ago|reply
[+] [-] tomrod|6 years ago|reply
[+] [-] exikyut|6 years ago|reply