I have a theory - tree based models require minimal feature engineering. They are capable of handling categorical data in principled ways, they can handle the most skewed/multimodal/heteroskedastic continuous numeric data just as easily as a 0-1 scaled normal distribution, and they are easy to regularize compared to a DL model (which could have untold millions of possible parameter combinations, let alone getting the thing to train to a global optimum).
I think if you spent months getting your data and model structure to a good place, you could certainly get a DL model to out-perform a gradient boosted tree. But why do that, when the GBT will be done today?
this is along the lines of my thinking. people organize and summarize data before throwing it into spreadsheets, where deep learning models do their thing by generating new representations from raw data.
in a sense, most data in spreadsheets is compressed and deep learning models prefer to find their own compression that best suits the task at hand.
or in human terms: "these spreadsheets are garbage. i can't work with this. can you bring me the raw data please?" :)
> I have a theory - tree based models require minimal feature engineering.
Actually, the whole premise of Deep Learning is to learn proper feature representations from data with minimal data preprocessing. And it works wonderfully in CV and NLP but is less performant in tabular data. The paper indicates that there are several contributing factors to the DL underperforming.
I think another aspect is that most modern GBT models prefer the entire dataset to be in memory, thereby doing a full scan of the data for each iteration to calculate the optimal split point. That’s hard to compete with if your batch size is small in a NN model.
I think you’re on the right track that trees are good at feature engineering. But the key problem is that DL researchers are horrible at feature engineering, because they have never had to do it. These folks included.
The feature engineering they do here is absolutely horrible! They use a QuantileTransform and that’s it. They don’t even tune the critical hyper parameter of the number of quantiles. Do they always use the scikitlearn default of 1,000 quantiles? No wonder uninformative features are hurting- they are getting expanded into 1000 even more uninformative features! Also with a single quantile transform like that, the relative values of the quantiles are completely lost! If the values 86 and 87 fall into different bins, the model has literally no information that the two bins are similar to each other, or even that they come from the same raw input.
For a very large dataset a NN would learn its way around this kind of bone headed mistake. But for this size dataset, these researchers have absolutely crippled the nets with this thoughtless approach to feature engineering.
There is plenty more to criticize about their experiments, but it’s probably less important. E.g. Their HP ranges are too small to allow for the kind of nets that are known to work best in the modern era (after Double Descent theory has been worked out) - large heavily regularized nets. They don’t let the nets get very big and they don’t let the regularization get nearly big enough.
So. Bad comparison. But it’s also very true that XGB “just works” most of the time. NN’s are finicky and complicated and very few people really understand them well enough to apply them to novel situations. Those who do are working on fancy AI problems, not writing poor comparison papers like this one.
What are the principled ways that tree based models handle categorical data? If you end up having to do one-hot encoding it feels like you need very wide forests or very deep trees. If your categorical data is actually vaguely continuous then splits can be quite efficient but that’s rare.
I assume some day someone will be able to explain all this in information theoretic terms. I’m never sure if we’re comparing like with like (are the deep learning models we’re comparing against actually that deep, for example?) but clearly there’s something to the intuition that many small overfit models are more efficient than one big general model.
> Results show that tree-based models remain state-of-the-art on medium-sized data (∼10K samples) even without accounting for their superior speed.
Is that really "medium"? That seems very small to me. MNIST has 60,000 samples and ImageNet has millions.
I think the title overstates the findings. I'd be interested to hear how these methods compare on much larger datasets. Is there a threshold at which deep learning outperforms tree-based models?
Edit: They touch on this in the appendix:
> A.2.2 Large-sized datasets
> We extend our benchmark to large-scale datasets: in Figures 9, 10, 11 and 12, we compare the results
of our models on the same set of datasets, in large-size (train set truncated to 50,000 samples) and
medium-size (train set truncated to 10,000 samples) settings.
> We only keep datasets with more than 50,000 samples and restrict the train set size to 50,000 samples
(vs 10,000 samples for the medium-sized benchmark). Unfortunately, this excludes a lot of datasets,
which makes the comparison less clear. However, it seems that, in most cases, increasing the train set
size reduces the gap between neural networks and tree-based models. We leave a rigorous study of
this trend to future work.
Many real world problems that result in data are decidedly medium: small enough to fit in excel, large enough to be too big to comfortable handle in excel.
I've put in production numerous models with millions of tabular data points and a 10^5-10^6 feature space where tree-based models (or FF nets) outperform more complex DL approaches.
MNIST is not your typical real world tabular data. Many if not most data science problems out there are still in the range of a few k samples from my perspective (trying to "sell" ML to the average company) From a statistical point of view I would not call the datasets small (you can decently compare two means from subsets without needing a student's distribution).
Assuming the categories are meant to apply to any data sets, anything amenable to machine learning at all is at least medium data. "Small" data would be something like a human trial with n=6 because the length and compliance of the protocol is so onerous. There are entirely different statistical techniques for finding significance in the face of extremely low power.
It's baffling to me how little research attention there has been to improving tree-based methods, considering their effectiveness.
For example, LightGBM and XGBoost allow some regularization terms, but the variance/bias is still mostly controlled by globally setting the max depth and max node count (and then parameter searching to find good settings).
Surely there must be more powerful and sophisticated ways of deciding when to stop building each tree than counting the number of nodes? If this was neural nets there would be a hundred competing papers proposing different methods and arguing over their strengths and weaknesses.
I'm not sure whether the problem is that neural nets are just fundamentally more sexy, or that in order to make SOTA improvements in GBMs you need to dive into some gnarly C++. Probably both.
Why do you think there has been little research attention? Time was, 'machine learning' was little but tree-based methods (and that was how they distinguished themselves from 'AI'). Go look at Breiman's CV or random conference proceedings. Or as tree-based method proponents love to point out, pretty much everyone on Kaggle up until recently used trees for everything non-image-based; that's a ton of effort invested in tweaking trees. And there were hardware efforts to accelerate them (I recall MS talking about how they were investing in FPGAs for MS Azure to run trees better), so 'GPUs' isn't an excuse.
> LightGBM
not improving, meanwhile 8-figure budgets builds GPU and auto-logins..
My take? management agenda to build plug-and-play researchers
(humans on jobs), rather than domain specialists. DeepLearning fits that description with all-plumbing, all-the-time.. domain specialists want graduate school, weekends and health benefits..
There are a fair number of papers (start with Dart/dropout, Bart (bayesian sampling of the whole gbm) but they start to look like global optimization problems and part of the reasons trees work so well is that the local greedy optimization can be made super fast on modern cpu caches.
So even if you can fit a more compact forest that performs well through clever regularization its usually better/faster in practice to grow more simple trees with more randomization and let overfitting average out.
I think part of the problem is that the upper bound on neural nets, as far as we can tell, might very well be general intelligence, and things like self-driving cars, and other nearly magical use-cases that seem within reach. Whereas tree based models, for a series of reasons, many related to scaling, don't offer that feeling of limitless potential.
I like decision trees and this helps support my case for using them. I often go even further and don't use an algorithm to build the trees but build trees myself along intuitive causal lines and use the data to train its parameters. I sometimes build a few models manually and see which fits better with the data.
Prior knowledge can prevent the pitfalls of automatically built models.
Trees may be better than NNs because they overfit less but you can overfit even less with a bespoke model. For example, I've seen an automatically generated tree made to tune the efficiency of a factory end up using as a main feature "be after a specific date" because a machine was upgraded on that date and so the learning algorithm latched on to that unactionable piece of data as a main predictor for the model.
This was an easy fix, to not feed timestamp data to the model but there are lots of more subtle cases like this and I've seen people spend much time cleaning and "tweaking" the input data to get the answers they want out of their ML models.
If you have to try to make your ML model behave by manually selecting what data to feed it, you might as well go all the way and build a clean causal model yourself that reflect your priors and domain knowledge about the subject.
I have an ML background but I often get more performance out of my models by doing something along the lines of what a Bayesian statistician would do.
Of course with highly dimensional data like pixels in images you almost have no choice to use NNs. There's no way to hand-build these models.
Our lab works on changing this. I think it might still take some years for a full solution, but so far we are successful with NNs for small datasets with meta-learning (https://arxiv.org/abs/2207.01848) and large datasets with regularisation (https://arxiv.org/abs/2106.11189). The first is very new, but the second is also cited in this paper, but they didn’t run it as a baseline.
However, it omits to cite the highly relevant SRBench paper from 2021, which also carefully curates a suitable set of regression benchmarks and shows that Genetic Programming approaches also tend to be better than deep learning.
We use XGBoost as the core learner for reinforcement learning at https://improve.ai despite the popularity of neural networks in academia.
With tabular or nested data a human has already done a lot of work to organize that data in a machine friendly form - much of the feature engineering is performed by the data schema itself.
Yes, I've been looking at using decision trees for explaining models that are difficult to understand. Currently seeing useful results on real data sets. If you're interested, I've implemented parts of TREPAN [1] and it's very approachable. However it's also important to have interpretable features which is a whole other thing.
It seems to me that one crucial difference between tabular data and images or text is that in the latter there is a huge amount of structure available. In text, all the words depend on all their neighbor words, and images tend to be well-approximated by low-rank things in one form or another.
Exactly. I get why people like to compare things like NN's and trees because it's a good way to learn. But it doesn't take too much understanding to see that they both have strengths/weaknesses and are suitable for different problems.
I can't explain it, but I help maintain TensorFlow Decision Forests [1] and Yggdrasil Decision Forests [2], and in an AutoML system at work that trains models on lots of various users data, decision forest models gets selected as best (after AutoML tries various model types and hyperparameters) somewhere between 20% to 40% of the times, systematically. It's pretty interesting. Other ML types considered are NN, linear models (with auto feature crossings generation), and a couple of other variations.
Super interesting! Do you know the kind of data that it's usually used for? And in the remaining 80% to 60%, do NNs acccount for a large portion of the best models?
Bonus question: are the stats you're mentioning publically available?
I think one of the issues is that there's no pretrained universal spreadsheet models (that I'm aware of, granted I don't do much work with tabular data) equivalent to ImageNet based models that you can use as a base and then transfer learn on top of that.
A important point is that it's an absolute pain in the ass to preprocess tabular data for neural networks.
Categorical > one hot encoding > deal with new categories in test time (sklearn does this, but it's really slow and clunky)
Numerical > either figure it out the data distribution for each column and normalize by that or normalize everything by z score. Found an outlier?? Oops, every feature collapsed to 0
Can you that for 10 features? Sure, now try it again with 500, it's not fun
Ok, now that you've done all that you can begin training and possibly get some reasonable result.
It's all about features and data scale. Recommendation system itself is actually a large table, DL method already proved effectiveness there. Let's say if you have text in your tabular data. Tree model(with traditional method such as tfidf) will do much worse than the transformer-based model. DL always suffers from inadequate data, so if there's no enough data or inductive biases, tree model can be a better choice in that way.
Well iirc a convenient trait of a random forest classifier is that it cannot overfit onto learning data. Something that's not exactly true for deep learning.
Any reference for this claim? In my opinion this is most certainly not the case - random forests are hard to overfit compared to gradient boosted trees but you can overfit with them too if you don't tune your parameters right.
Overfitting is generally a function of the the size of your data and the complexity expressible in your model.
[+] [-] CapmCrackaWaka|3 years ago|reply
I think if you spent months getting your data and model structure to a good place, you could certainly get a DL model to out-perform a gradient boosted tree. But why do that, when the GBT will be done today?
[+] [-] a-dub|3 years ago|reply
in a sense, most data in spreadsheets is compressed and deep learning models prefer to find their own compression that best suits the task at hand.
or in human terms: "these spreadsheets are garbage. i can't work with this. can you bring me the raw data please?" :)
[+] [-] lr1970|3 years ago|reply
Actually, the whole premise of Deep Learning is to learn proper feature representations from data with minimal data preprocessing. And it works wonderfully in CV and NLP but is less performant in tabular data. The paper indicates that there are several contributing factors to the DL underperforming.
[+] [-] drzoltar|3 years ago|reply
[+] [-] oofbey|3 years ago|reply
The feature engineering they do here is absolutely horrible! They use a QuantileTransform and that’s it. They don’t even tune the critical hyper parameter of the number of quantiles. Do they always use the scikitlearn default of 1,000 quantiles? No wonder uninformative features are hurting- they are getting expanded into 1000 even more uninformative features! Also with a single quantile transform like that, the relative values of the quantiles are completely lost! If the values 86 and 87 fall into different bins, the model has literally no information that the two bins are similar to each other, or even that they come from the same raw input.
For a very large dataset a NN would learn its way around this kind of bone headed mistake. But for this size dataset, these researchers have absolutely crippled the nets with this thoughtless approach to feature engineering.
There is plenty more to criticize about their experiments, but it’s probably less important. E.g. Their HP ranges are too small to allow for the kind of nets that are known to work best in the modern era (after Double Descent theory has been worked out) - large heavily regularized nets. They don’t let the nets get very big and they don’t let the regularization get nearly big enough.
So. Bad comparison. But it’s also very true that XGB “just works” most of the time. NN’s are finicky and complicated and very few people really understand them well enough to apply them to novel situations. Those who do are working on fancy AI problems, not writing poor comparison papers like this one.
[+] [-] ellisv|3 years ago|reply
[+] [-] thom|3 years ago|reply
I assume some day someone will be able to explain all this in information theoretic terms. I’m never sure if we’re comparing like with like (are the deep learning models we’re comparing against actually that deep, for example?) but clearly there’s something to the intuition that many small overfit models are more efficient than one big general model.
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] jb_s|3 years ago|reply
[+] [-] username_exists|3 years ago|reply
[+] [-] Permit|3 years ago|reply
Is that really "medium"? That seems very small to me. MNIST has 60,000 samples and ImageNet has millions.
I think the title overstates the findings. I'd be interested to hear how these methods compare on much larger datasets. Is there a threshold at which deep learning outperforms tree-based models?
Edit: They touch on this in the appendix:
> A.2.2 Large-sized datasets
> We extend our benchmark to large-scale datasets: in Figures 9, 10, 11 and 12, we compare the results of our models on the same set of datasets, in large-size (train set truncated to 50,000 samples) and medium-size (train set truncated to 10,000 samples) settings.
> We only keep datasets with more than 50,000 samples and restrict the train set size to 50,000 samples (vs 10,000 samples for the medium-sized benchmark). Unfortunately, this excludes a lot of datasets, which makes the comparison less clear. However, it seems that, in most cases, increasing the train set size reduces the gap between neural networks and tree-based models. We leave a rigorous study of this trend to future work.
[+] [-] beckingz|3 years ago|reply
[+] [-] mochomocha|3 years ago|reply
[+] [-] riedel|3 years ago|reply
[+] [-] nonameiguess|3 years ago|reply
[+] [-] _pastel|3 years ago|reply
For example, LightGBM and XGBoost allow some regularization terms, but the variance/bias is still mostly controlled by globally setting the max depth and max node count (and then parameter searching to find good settings).
Surely there must be more powerful and sophisticated ways of deciding when to stop building each tree than counting the number of nodes? If this was neural nets there would be a hundred competing papers proposing different methods and arguing over their strengths and weaknesses.
I'm not sure whether the problem is that neural nets are just fundamentally more sexy, or that in order to make SOTA improvements in GBMs you need to dive into some gnarly C++. Probably both.
[+] [-] gwern|3 years ago|reply
[+] [-] mistrial9|3 years ago|reply
My take? management agenda to build plug-and-play researchers (humans on jobs), rather than domain specialists. DeepLearning fits that description with all-plumbing, all-the-time.. domain specialists want graduate school, weekends and health benefits..
[+] [-] micro_cam|3 years ago|reply
So even if you can fit a more compact forest that performs well through clever regularization its usually better/faster in practice to grow more simple trees with more randomization and let overfitting average out.
[+] [-] natalyarostova|3 years ago|reply
[+] [-] jwilber|3 years ago|reply
and random forests here: https://mlu-explain.github.io/random-forest/
It’s also worth noting that a recentish paper shows neural networks can perform well on tabular data if well-regularized: https://arxiv.org/abs/2106.11189v1?utm_source=jesper&utm_med...
[+] [-] lnenad|3 years ago|reply
[+] [-] isabellat|3 years ago|reply
[+] [-] BenoitEssiambre|3 years ago|reply
Prior knowledge can prevent the pitfalls of automatically built models.
Trees may be better than NNs because they overfit less but you can overfit even less with a bespoke model. For example, I've seen an automatically generated tree made to tune the efficiency of a factory end up using as a main feature "be after a specific date" because a machine was upgraded on that date and so the learning algorithm latched on to that unactionable piece of data as a main predictor for the model.
This was an easy fix, to not feed timestamp data to the model but there are lots of more subtle cases like this and I've seen people spend much time cleaning and "tweaking" the input data to get the answers they want out of their ML models.
If you have to try to make your ML model behave by manually selecting what data to feed it, you might as well go all the way and build a clean causal model yourself that reflect your priors and domain knowledge about the subject.
I have an ML background but I often get more performance out of my models by doing something along the lines of what a Bayesian statistician would do.
Of course with highly dimensional data like pixels in images you almost have no choice to use NNs. There's no way to hand-build these models.
[+] [-] hackerlight|3 years ago|reply
[+] [-] ersiees|3 years ago|reply
[+] [-] isolli|3 years ago|reply
[0] https://towardsdatascience.com/n-beats-beating-statistical-m...
[+] [-] nomel|3 years ago|reply
[+] [-] mr_toad|3 years ago|reply
[+] [-] jmmcd|3 years ago|reply
However, it omits to cite the highly relevant SRBench paper from 2021, which also carefully curates a suitable set of regression benchmarks and shows that Genetic Programming approaches also tend to be better than deep learning.
https://github.com/cavalab/srbench
cc u/optimalsolver
[+] [-] orasis|3 years ago|reply
With tabular or nested data a human has already done a lot of work to organize that data in a machine friendly form - much of the feature engineering is performed by the data schema itself.
[+] [-] wills_forward|3 years ago|reply
[+] [-] oofbey|3 years ago|reply
[+] [-] aimor|3 years ago|reply
[1] https://research.cs.wisc.edu/machine-learning/shavlik-group/...
[+] [-] marcodiego|3 years ago|reply
[+] [-] bigbillheck|3 years ago|reply
Tabular data doesn't have any of that.
[+] [-] officehero|3 years ago|reply
[+] [-] coffee_am|3 years ago|reply
[1] https://github.com/tensorflow/decision-forests [2] https://github.com/google/yggdrasil-decision-forests
[+] [-] onasta|3 years ago|reply
Bonus question: are the stats you're mentioning publically available?
[+] [-] kriro|3 years ago|reply
[+] [-] fdgsdfogijq|3 years ago|reply
[+] [-] savant_penguin|3 years ago|reply
Categorical > one hot encoding > deal with new categories in test time (sklearn does this, but it's really slow and clunky)
Numerical > either figure it out the data distribution for each column and normalize by that or normalize everything by z score. Found an outlier?? Oops, every feature collapsed to 0
Can you that for 10 features? Sure, now try it again with 500, it's not fun
Ok, now that you've done all that you can begin training and possibly get some reasonable result.
Compare that with tree models: data>model>results
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] VoVAllen|3 years ago|reply
[+] [-] moffkalast|3 years ago|reply
[+] [-] omegalulw|3 years ago|reply
Overfitting is generally a function of the the size of your data and the complexity expressible in your model.
[+] [-] taeric|3 years ago|reply
[+] [-] lupire|3 years ago|reply
Neural nets have far more parameters and so are susceptible to overfitting with more training time.
[+] [-] melony|3 years ago|reply
[+] [-] LittlePeter|3 years ago|reply
I can represent an image as a table of RGB values. I can represent hierarchical data as a table of unnested values.