top | item 42099997

(no title)

loubbrad | 1 year ago

I didn't see it referenced directly anywhere in this post. However, for those interested, automatic music transcription (i.e., audio->MIDI) is actually a decently sized subfield of deep learning and music information retrieval.

There have been several successful models for multi-track music transcription - see Google's MT3 project (https://research.google/pubs/mt3-multi-task-multitrack-music...). In the case of piano transcription, accuracy is nearly flawless at this point, even for very low-quality audio:

https://github.com/EleutherAI/aria-amt

Full disclaimer: I am the author of the above repo.

discuss

Earw0rm|1 year ago

He's trying to solve a second (also hard ish) problem as well, deriving an accurate musical score from MIDI data. It's a "sounds easy but isn't" problem, especially when audio to MIDI transcribers are great at pitch and onset times, but rather less reliable at duration and velocity.

loubbrad|1 year ago

I agree that the audio->score and MIDI->score problems are quite hard. There has been research in this area too, however it is far less developed than audio->MIDI.

bravura|1 year ago

I know the reported scores of MT3 are very good, but have you had success with using it yourself?

https://replicate.com/turian/multi-task-music-transcription

I ported their colab to runtime so I could use it more easily.

The MIDI output is... puzzling?

I've tried feeding it even simple stems and found the output unusable for some tracks, i.e. the MIDI output and audio were not well aligned and there were timing issues. On other audio it seemed to work fine.

loubbrad|1 year ago

Multi-track transcription has a long way to go before it seriously useful for real-world applications. Ultimately I think that converting audio into MIDI makes a lot more sense for piano/guitar transcription than it does for complex multi-instrument works with sound effects ect...

Luckily for me, audio-to-seq approaches do work very well for piano, which turns out to be an amazing way of getting expressive MIDI data for training generative models.

air217|1 year ago

I developed https://pyaar.ai, it uses MT3 under the hood. I realized that continuous string instruments (guitar) that have things like slides, bends are quite difficult to capture in MIDI. Piano works much better because it's more discrete (the keys abstract away the strings) and so the MIDI file has better representation

WiSaGaN|1 year ago

How does the problem simplify when it's restricted to piano?

loubbrad|1 year ago

Essentially, the leading way to do automatic music transcription is to train a neural network on supervised data, i.e., paired audio-MIDI data. In the case of piano recordings, there is a very good dataset for this task which was released by Google in 2018:

https://magenta.tensorflow.org/datasets/maestro

Most current research involves refining deep learning based approaches to this task. When I worked on this problem earlier this year, I was interested in adding robustness to these models by training a sort of musical awareness into them. You can see a good example of it in this tweet:

https://x.com/loubbrad/status/1794747652191777049