top | item 39832200

(no title)

kastnerkyle | 1 year ago

The direct counter-argument to "worst representation" is usually "representation with fewest assumptions", waveform as shown here is getting close. Though recording environment, equipment, how the sound actually gets digitized, etc. also come into play, there are relatively few assumptions in the "waveform" setup described here.

I would say in the neural network literature at large, and in audio modeling particularly, this continual back and forth of pushing DSP-based knowledge into neural nets, on the architecture side or data side, versus going "raw-er" to force models to learn their own versions of DSP-style transforms has been and will continue to be a see-saw, as we try to find what works best, driven by performance on benchmarks with certain goals in mind.

These types of push-pull movements also dominate computer vision (where many of the "correct" DSP approaches fell away to less-rigid, learned proxies), and language modeling (tokenization is hardly "raw", and byte based approaches to-date lag behind smart tokenization strategies), and I think every field which approaches learning from data will have various swings over time.

CCD bitstreams are also not "raw", so people will continue to push down in representation while making bigger datasets and models, and the rollercoaster will continue.

discuss

order

gorkish|1 year ago

Yours is the best response to my comment so far.

I very much enjoy the observation that LLM's appear to function optimally when trained on "tokens" and not the pure unfiltered stream of characters. I think I am ultimately attempting to express an analogous belief that the individual audio samples here are as meaningless as the individual letters are to an LLM.

Instead of "representation with the fewest assumptions" I would maybe suggest that the optimal input for a model may be the representation where the data is broken apart as far as it can be while still remaining meaningful. I have suggested in other replies that this is perhaps achieved with quadrature samples or even perhaps with something such as a granular decomposition -- something akin to a "token" of audio instead of language.