top | item 42377804

(no title)

cochlear | 1 year ago

I think I'm beginning to wrap my head around the way modern, "deep" state-space models (e.g Mamba, S4, etc.) leverage polynomial multiplication to speed up very long convolutions.

I'm curious if there are other methods for approximating long convolutions that are well-known or widely-used, outside of overlap-add and overlap-save? I'm in the audio field and interested in learning long _FIR_ filters to describe the resonances of physical objects, like instruments, or rooms. Block-coding, or fixed-frame size approaches reign supreme, of course, but have their own issues in terms of windowing artifacts, etc.

I'm definitely aware that multiplication in the (complex) frequency domain is equivalent to convolution in the time domain and that, because of the fast-fourier transform, this can yield increased efficiency. However, this still results in storing a lot of gradient information that my intuition tells me (possibly incorrectly) is full of redundancy and waste.

Stateful, IIR, or auto-regressive approaches are _one_ obvious answer, but this changes the game in terms of training and inference parallelization.

A couple ideas I've considered, but have not yet tried, or looked too deeply into:

- First performing PCA in the complex frequency domain, reducing the point-wise multiplication that must occur. Without some additional normalization up-front, it's likely this would be equivalent to downsampling/low-pass filtering and performing the convolution there. The learnable filter bank would live in the PCA space, reducing the overall number of learned parameters.

- A Compressed Sensing inspired approach, where we perform a sparse, sub-sampled random set of points from both signals and recover the full result based on the assumption that both convolver and convolvee? are sparse in the fourier domain. This one is pretty half-baked.

I'd love to hear about papers you've read, or thoughts you've had about this problem.

discuss

touisteur|1 year ago

The convolution by FFT overlap and save can have very low intermediate storage (none on GPU with cuFFTDx for example). And most of the time, the IFFT doesn't have to happen right away, lots processing can still be performed in the frequential domain.

Having each of 18k CUDA cores of a L40s perform small 128-points FFTs and with very little sync or overlap manage long filters... is pretty efficient by itself.

There's a lot happening in the HPC world on low-rank (what you're intuiting with PCA), sparse and tiled operations. I have a hard time applying all this to 'simple' signal processing and most of it lacks nicer APIs.

I've seen lots of interesting things with 'irregular FFT' codes and working on reducing either the storage space necessary for FFT intermediate results, sometimes through multi-resolution tricks.

Look up Capon filters and adaptative filtering in general, there's a whole world of tricks there too. You might need a whole lot of SVDs and matrix inversions there...

Bust mostly if you're on a GPU there's a wealth of parallelism to exploit and work-around the 'memory-bound' limits of FFT-based convolution. This thesis https://theses.hal.science/tel-04542844 had some discussion and numbers on the topic. Not complete but inspiring.

wbl|1 year ago

The gradient information in backroom can be computed similarly to forwards I think. Certainly the FFT blocks are linear and so now it's a question about the multiplication which is pretty compact.