I'm working on building models that extract sparse and easy-to-interpret representations of musical audio. The work in this post encodes short segments of music from the MusicNet dataset as a set of events with a time-of-occurrence and a low-dimensional vector representing attack envelopes and resonances of both the instrument being played and the room in which the performance occurred. I think this representation could prove superior to current block-coding (fixed-frame sizes) and text-based generation models, at least for musicians who need fine-grained control of generated audio.
No comments yet.