top | item 42358050

Long Convolutions via Polynomial Multiplication

46 points| cochlear | 1 year ago |hazyresearch.stanford.edu

13 comments

Talking long convolutions, on parallel architectures, through FFT there's a lot of performance to gain from Overlap-and-Add or Overlap-and-Save schemes, especially on GPUs with cuFFTDx which brings a whole new set of device primitives to the table and looking at (or getting inspiration from) tcFFT which allows using Tensor Cores for FFT and actually increases throughput on lots of convolution workloads.

cochlear|1 year ago

I think I'm beginning to wrap my head around the way modern, "deep" state-space models (e.g Mamba, S4, etc.) leverage polynomial multiplication to speed up very long convolutions.

I'm curious if there are other methods for approximating long convolutions that are well-known or widely-used, outside of overlap-add and overlap-save? I'm in the audio field and interested in learning long _FIR_ filters to describe the resonances of physical objects, like instruments, or rooms. Block-coding, or fixed-frame size approaches reign supreme, of course, but have their own issues in terms of windowing artifacts, etc.

I'm definitely aware that multiplication in the (complex) frequency domain is equivalent to convolution in the time domain and that, because of the fast-fourier transform, this can yield increased efficiency. However, this still results in storing a lot of gradient information that my intuition tells me (possibly incorrectly) is full of redundancy and waste.

Stateful, IIR, or auto-regressive approaches are _one_ obvious answer, but this changes the game in terms of training and inference parallelization.

A couple ideas I've considered, but have not yet tried, or looked too deeply into:

- First performing PCA in the complex frequency domain, reducing the point-wise multiplication that must occur. Without some additional normalization up-front, it's likely this would be equivalent to downsampling/low-pass filtering and performing the convolution there. The learnable filter bank would live in the PCA space, reducing the overall number of learned parameters.

- A Compressed Sensing inspired approach, where we perform a sparse, sub-sampled random set of points from both signals and recover the full result based on the assumption that both convolver and convolvee? are sparse in the fourier domain. This one is pretty half-baked.

I'd love to hear about papers you've read, or thoughts you've had about this problem.

almostgotcaught|1 year ago

I don't understand this reasoning. Depending on the target (GPU/DSP/FPGA), going to frequency domain for convolution made some amount of sense when FFT primitives were highly optimized relative to conventional conv or matmul implementations. But now we're like 10 years into the software/hardware arms race and the conv/matmul kernels are just as highly optimized. In addition the hardware has adapted too.

> using Tensor Cores for FFT

Why would I do this when I could just directly use tensor cores for matmul...? We have MMA, WMMA, WGMMA, etc and they all target tensor cores explicitly.

human_llm|1 year ago

Also known as Z-transform in digital signal processing.

JimmyWilliams1|1 year ago

[deleted]