top | item 40980151

(no title)

dongobread | 1 year ago

From experience in payments/spending forecasting, I've found that deep learning generally underperform gradient-boosted tree models. Deep learning models tend to be good at learning seasonality but do not handle complex trends or shocks very well. Economic/financial data tends to have straightforward seasonality with complex trends, so deep learning tends to do quite poorly.

I do agree with this paper - all of the good deep learning time series architectures I've tried are simple extensions of MLPs or RNNs (e.g. DeepAR or N-BEATS). The transformer-based architectures I've used have been absolutely awful, especially the endless stream of transformer-based "foundational models" that are coming out these days.

discuss

order

sigmoid10|1 year ago

Transformers are just MLPs with extra steps. So in theory they should be just as powerful. The problem with transformers is simultaneously their big advantage: They scale extremely well with larger networks and more training data. Better so than any other architecture out there. So if you had enormous datasets and unlimited compute budget, you could probably do amazing things in this regard as well. But if you're just a mortal data scientist without extra funding, you will be better off with more traditional approaches.

dongobread|1 year ago

I think what you say is true when comparing transformers to CNNs/RNNs, but not to MLPs.

Transformers, RNNs, and CNNs are all techniques to reduce parameter count compared to a pure-MLP model. If you took a transformer model and replaced each self-attention layer with a linear layer+activation function, you'd have a pure MLP model that can model every relationship the transformer does, but can model more possible relationships as well (but at the cost of tons more parameters). MLPs are more powerful/scalable but transformers are more efficient.

Compared to MLPs, transformers save on parameter count by skimping on the number of parameters devoted to modeling the relationship between tokens. This works in language modeling, where relationships between tokens isn't that important - you can jumble up the words in this sentence and it still mostly makes sense. This doesn't work in time series, where relationships between tokens (timesteps) is the most important thing of all. The LTSF paper linked in the OP paper also mentions this same problem: https://arxiv.org/pdf/2205.13504 (see section 1)