top | item 32333565

Why do tree-based models still outperform deep learning on tabular data?

315 points| isolli | 3 years ago |arxiv.org

139 comments

order
[+] CapmCrackaWaka|3 years ago|reply
I have a theory - tree based models require minimal feature engineering. They are capable of handling categorical data in principled ways, they can handle the most skewed/multimodal/heteroskedastic continuous numeric data just as easily as a 0-1 scaled normal distribution, and they are easy to regularize compared to a DL model (which could have untold millions of possible parameter combinations, let alone getting the thing to train to a global optimum).

I think if you spent months getting your data and model structure to a good place, you could certainly get a DL model to out-perform a gradient boosted tree. But why do that, when the GBT will be done today?

[+] a-dub|3 years ago|reply
this is along the lines of my thinking. people organize and summarize data before throwing it into spreadsheets, where deep learning models do their thing by generating new representations from raw data.

in a sense, most data in spreadsheets is compressed and deep learning models prefer to find their own compression that best suits the task at hand.

or in human terms: "these spreadsheets are garbage. i can't work with this. can you bring me the raw data please?" :)

[+] lr1970|3 years ago|reply
> I have a theory - tree based models require minimal feature engineering.

Actually, the whole premise of Deep Learning is to learn proper feature representations from data with minimal data preprocessing. And it works wonderfully in CV and NLP but is less performant in tabular data. The paper indicates that there are several contributing factors to the DL underperforming.

[+] drzoltar|3 years ago|reply
I think another aspect is that most modern GBT models prefer the entire dataset to be in memory, thereby doing a full scan of the data for each iteration to calculate the optimal split point. That’s hard to compete with if your batch size is small in a NN model.
[+] oofbey|3 years ago|reply
I think you’re on the right track that trees are good at feature engineering. But the key problem is that DL researchers are horrible at feature engineering, because they have never had to do it. These folks included.

The feature engineering they do here is absolutely horrible! They use a QuantileTransform and that’s it. They don’t even tune the critical hyper parameter of the number of quantiles. Do they always use the scikitlearn default of 1,000 quantiles? No wonder uninformative features are hurting- they are getting expanded into 1000 even more uninformative features! Also with a single quantile transform like that, the relative values of the quantiles are completely lost! If the values 86 and 87 fall into different bins, the model has literally no information that the two bins are similar to each other, or even that they come from the same raw input.

For a very large dataset a NN would learn its way around this kind of bone headed mistake. But for this size dataset, these researchers have absolutely crippled the nets with this thoughtless approach to feature engineering.

There is plenty more to criticize about their experiments, but it’s probably less important. E.g. Their HP ranges are too small to allow for the kind of nets that are known to work best in the modern era (after Double Descent theory has been worked out) - large heavily regularized nets. They don’t let the nets get very big and they don’t let the regularization get nearly big enough.

So. Bad comparison. But it’s also very true that XGB “just works” most of the time. NN’s are finicky and complicated and very few people really understand them well enough to apply them to novel situations. Those who do are working on fancy AI problems, not writing poor comparison papers like this one.

[+] ellisv|3 years ago|reply
I agree. The majority of DL layers are about feature engineering, not performing classification.
[+] thom|3 years ago|reply
What are the principled ways that tree based models handle categorical data? If you end up having to do one-hot encoding it feels like you need very wide forests or very deep trees. If your categorical data is actually vaguely continuous then splits can be quite efficient but that’s rare.

I assume some day someone will be able to explain all this in information theoretic terms. I’m never sure if we’re comparing like with like (are the deep learning models we’re comparing against actually that deep, for example?) but clearly there’s something to the intuition that many small overfit models are more efficient than one big general model.

[+] jb_s|3 years ago|reply
do you reckon it's possible to somehow transfer learn from a GBT to a NN ?
[+] Permit|3 years ago|reply
> Results show that tree-based models remain state-of-the-art on medium-sized data (∼10K samples) even without accounting for their superior speed.

Is that really "medium"? That seems very small to me. MNIST has 60,000 samples and ImageNet has millions.

I think the title overstates the findings. I'd be interested to hear how these methods compare on much larger datasets. Is there a threshold at which deep learning outperforms tree-based models?

Edit: They touch on this in the appendix:

> A.2.2 Large-sized datasets

> We extend our benchmark to large-scale datasets: in Figures 9, 10, 11 and 12, we compare the results of our models on the same set of datasets, in large-size (train set truncated to 50,000 samples) and medium-size (train set truncated to 10,000 samples) settings.

> We only keep datasets with more than 50,000 samples and restrict the train set size to 50,000 samples (vs 10,000 samples for the medium-sized benchmark). Unfortunately, this excludes a lot of datasets, which makes the comparison less clear. However, it seems that, in most cases, increasing the train set size reduces the gap between neural networks and tree-based models. We leave a rigorous study of this trend to future work.

[+] beckingz|3 years ago|reply
Many real world problems that result in data are decidedly medium: small enough to fit in excel, large enough to be too big to comfortable handle in excel.
[+] mochomocha|3 years ago|reply
I've put in production numerous models with millions of tabular data points and a 10^5-10^6 feature space where tree-based models (or FF nets) outperform more complex DL approaches.
[+] riedel|3 years ago|reply
MNIST is not your typical real world tabular data. Many if not most data science problems out there are still in the range of a few k samples from my perspective (trying to "sell" ML to the average company) From a statistical point of view I would not call the datasets small (you can decently compare two means from subsets without needing a student's distribution).
[+] nonameiguess|3 years ago|reply
Assuming the categories are meant to apply to any data sets, anything amenable to machine learning at all is at least medium data. "Small" data would be something like a human trial with n=6 because the length and compliance of the protocol is so onerous. There are entirely different statistical techniques for finding significance in the face of extremely low power.
[+] _pastel|3 years ago|reply
It's baffling to me how little research attention there has been to improving tree-based methods, considering their effectiveness.

For example, LightGBM and XGBoost allow some regularization terms, but the variance/bias is still mostly controlled by globally setting the max depth and max node count (and then parameter searching to find good settings).

Surely there must be more powerful and sophisticated ways of deciding when to stop building each tree than counting the number of nodes? If this was neural nets there would be a hundred competing papers proposing different methods and arguing over their strengths and weaknesses.

I'm not sure whether the problem is that neural nets are just fundamentally more sexy, or that in order to make SOTA improvements in GBMs you need to dive into some gnarly C++. Probably both.

[+] gwern|3 years ago|reply
Why do you think there has been little research attention? Time was, 'machine learning' was little but tree-based methods (and that was how they distinguished themselves from 'AI'). Go look at Breiman's CV or random conference proceedings. Or as tree-based method proponents love to point out, pretty much everyone on Kaggle up until recently used trees for everything non-image-based; that's a ton of effort invested in tweaking trees. And there were hardware efforts to accelerate them (I recall MS talking about how they were investing in FPGAs for MS Azure to run trees better), so 'GPUs' isn't an excuse.
[+] mistrial9|3 years ago|reply
> LightGBM not improving, meanwhile 8-figure budgets builds GPU and auto-logins..

My take? management agenda to build plug-and-play researchers (humans on jobs), rather than domain specialists. DeepLearning fits that description with all-plumbing, all-the-time.. domain specialists want graduate school, weekends and health benefits..

[+] micro_cam|3 years ago|reply
There are a fair number of papers (start with Dart/dropout, Bart (bayesian sampling of the whole gbm) but they start to look like global optimization problems and part of the reasons trees work so well is that the local greedy optimization can be made super fast on modern cpu caches.

So even if you can fit a more compact forest that performs well through clever regularization its usually better/faster in practice to grow more simple trees with more randomization and let overfitting average out.

[+] natalyarostova|3 years ago|reply
I think part of the problem is that the upper bound on neural nets, as far as we can tell, might very well be general intelligence, and things like self-driving cars, and other nearly magical use-cases that seem within reach. Whereas tree based models, for a series of reasons, many related to scaling, don't offer that feeling of limitless potential.
[+] jwilber|3 years ago|reply
If you’re interested in how tree-based models work, I wrote an interactive explanation on decision trees here: https://mlu-explain.github.io/decision-tree/

and random forests here: https://mlu-explain.github.io/random-forest/

It’s also worth noting that a recentish paper shows neural networks can perform well on tabular data if well-regularized: https://arxiv.org/abs/2106.11189v1?utm_source=jesper&utm_med...

[+] lnenad|3 years ago|reply
That was super easy to digest, thank you!
[+] isabellat|3 years ago|reply
Really nice interactive explanations!
[+] BenoitEssiambre|3 years ago|reply
I like decision trees and this helps support my case for using them. I often go even further and don't use an algorithm to build the trees but build trees myself along intuitive causal lines and use the data to train its parameters. I sometimes build a few models manually and see which fits better with the data.

Prior knowledge can prevent the pitfalls of automatically built models.

Trees may be better than NNs because they overfit less but you can overfit even less with a bespoke model. For example, I've seen an automatically generated tree made to tune the efficiency of a factory end up using as a main feature "be after a specific date" because a machine was upgraded on that date and so the learning algorithm latched on to that unactionable piece of data as a main predictor for the model.

This was an easy fix, to not feed timestamp data to the model but there are lots of more subtle cases like this and I've seen people spend much time cleaning and "tweaking" the input data to get the answers they want out of their ML models.

If you have to try to make your ML model behave by manually selecting what data to feed it, you might as well go all the way and build a clean causal model yourself that reflect your priors and domain knowledge about the subject.

I have an ML background but I often get more performance out of my models by doing something along the lines of what a Bayesian statistician would do.

Of course with highly dimensional data like pixels in images you almost have no choice to use NNs. There's no way to hand-build these models.

[+] hackerlight|3 years ago|reply
I want to know more about your process. How do you optimise the split point. How do you choose which feature to split on first
[+] ersiees|3 years ago|reply
Our lab works on changing this. I think it might still take some years for a full solution, but so far we are successful with NNs for small datasets with meta-learning (https://arxiv.org/abs/2207.01848) and large datasets with regularisation (https://arxiv.org/abs/2106.11189). The first is very new, but the second is also cited in this paper, but they didn’t run it as a baseline.
[+] isolli|3 years ago|reply
Related, an article from 2019 [0] on how neural network finally beat statistical models (e.g. ARIMA) in time-series forecasting.

[0] https://towardsdatascience.com/n-beats-beating-statistical-m...

[+] nomel|3 years ago|reply
I can only assume this was achieved years ago, as a black project in the financial market.
[+] mr_toad|3 years ago|reply
I had an econometrics professor who would have objected to calling ARIMA models statistics. He though they were no better than data mining.
[+] jmmcd|3 years ago|reply
A great paper and an important result.

However, it omits to cite the highly relevant SRBench paper from 2021, which also carefully curates a suitable set of regression benchmarks and shows that Genetic Programming approaches also tend to be better than deep learning.

https://github.com/cavalab/srbench

cc u/optimalsolver

[+] orasis|3 years ago|reply
We use XGBoost as the core learner for reinforcement learning at https://improve.ai despite the popularity of neural networks in academia.

With tabular or nested data a human has already done a lot of work to organize that data in a machine friendly form - much of the feature engineering is performed by the data schema itself.

[+] wills_forward|3 years ago|reply
Does anyone see explainability as another good reason to trees on tabular data, for which I think users would expect more digestable outputs?
[+] oofbey|3 years ago|reply
The kinds of trees that come out of these algorithms are so huge they really aren’t any more interpretable than a NN.
[+] aimor|3 years ago|reply
Yes, I've been looking at using decision trees for explaining models that are difficult to understand. Currently seeing useful results on real data sets. If you're interested, I've implemented parts of TREPAN [1] and it's very approachable. However it's also important to have interpretable features which is a whole other thing.

[1] https://research.cs.wisc.edu/machine-learning/shavlik-group/...

[+] marcodiego|3 years ago|reply
Wait until we see deep learning AI creating tree-based models. /s
[+] bigbillheck|3 years ago|reply
It seems to me that one crucial difference between tabular data and images or text is that in the latter there is a huge amount of structure available. In text, all the words depend on all their neighbor words, and images tend to be well-approximated by low-rank things in one form or another.

Tabular data doesn't have any of that.

[+] officehero|3 years ago|reply
Exactly. I get why people like to compare things like NN's and trees because it's a good way to learn. But it doesn't take too much understanding to see that they both have strengths/weaknesses and are suitable for different problems.
[+] coffee_am|3 years ago|reply
I can't explain it, but I help maintain TensorFlow Decision Forests [1] and Yggdrasil Decision Forests [2], and in an AutoML system at work that trains models on lots of various users data, decision forest models gets selected as best (after AutoML tries various model types and hyperparameters) somewhere between 20% to 40% of the times, systematically. It's pretty interesting. Other ML types considered are NN, linear models (with auto feature crossings generation), and a couple of other variations.

[1] https://github.com/tensorflow/decision-forests [2] https://github.com/google/yggdrasil-decision-forests

[+] onasta|3 years ago|reply
Super interesting! Do you know the kind of data that it's usually used for? And in the remaining 80% to 60%, do NNs acccount for a large portion of the best models?

Bonus question: are the stats you're mentioning publically available?

[+] kriro|3 years ago|reply
I think one of the issues is that there's no pretrained universal spreadsheet models (that I'm aware of, granted I don't do much work with tabular data) equivalent to ImageNet based models that you can use as a base and then transfer learn on top of that.
[+] fdgsdfogijq|3 years ago|reply
Because high tabular data doesnt have enough complexity compared to language or images.
[+] savant_penguin|3 years ago|reply
A important point is that it's an absolute pain in the ass to preprocess tabular data for neural networks.

Categorical > one hot encoding > deal with new categories in test time (sklearn does this, but it's really slow and clunky)

Numerical > either figure it out the data distribution for each column and normalize by that or normalize everything by z score. Found an outlier?? Oops, every feature collapsed to 0

Can you that for 10 features? Sure, now try it again with 500, it's not fun

Ok, now that you've done all that you can begin training and possibly get some reasonable result.

Compare that with tree models: data>model>results

[+] VoVAllen|3 years ago|reply
It's all about features and data scale. Recommendation system itself is actually a large table, DL method already proved effectiveness there. Let's say if you have text in your tabular data. Tree model(with traditional method such as tfidf) will do much worse than the transformer-based model. DL always suffers from inadequate data, so if there's no enough data or inductive biases, tree model can be a better choice in that way.
[+] moffkalast|3 years ago|reply
Well iirc a convenient trait of a random forest classifier is that it cannot overfit onto learning data. Something that's not exactly true for deep learning.
[+] omegalulw|3 years ago|reply
Any reference for this claim? In my opinion this is most certainly not the case - random forests are hard to overfit compared to gradient boosted trees but you can overfit with them too if you don't tune your parameters right.

Overfitting is generally a function of the the size of your data and the complexity expressible in your model.

[+] taeric|3 years ago|reply
I'm curious on this claim. Feels like any model can over fit.
[+] lupire|3 years ago|reply
Random forests can overfit, but they naturally have fewer parameters than nets, so don't overfit as a function of training time.

Neural nets have far more parameters and so are susceptible to overfitting with more training time.

[+] melony|3 years ago|reply
If you train your ensemble long enough, won't it overfit too?
[+] LittlePeter|3 years ago|reply
What is tabular data precisely?

I can represent an image as a table of RGB values. I can represent hierarchical data as a table of unnested values.