top | item 32720988

(no title)

> The gap between tree-based models and deep learning becomes narrower as the dataset size increases (here: 10k -> 50k).

I am curious if there is a sample threshold where it's worth exploring deep learning approaches to tabular data. I wonder if there are other considerations (e.g., inference speed, explainability, etc.).

discuss

Tenoke|3 years ago

>if there is a sample threshold where it's worth exploring deep learning

Not especially, but there are tasks where DL models seem to occasionally outperform by a little. If you really want to milk extra accuracy it can be worth it to try a DL model, and if it performs as well/better you can use it to make an ensemble along with your GBM or replace the GBM, though it's rare that it is worth it. If you check tabular data kaggle winner writeups most use gbms or an ensemble for a tiny boost over just a GBM.

Assuming limited time to work on the problem, you'd almost always want to focus on further feature engineering first and likely some hyperparameter tuning second.

max_|3 years ago

what's GBM?

michaelscott|3 years ago

I worked on a little side project for classification on tabular data, but a really challenging use case where the data was prone to a lot of noise and some randomness in the dependent variable. Tree models couldn't get a high enough accuracy, and when the dataset was under roughly 6k entries deep learning performed even worse (as expected).

What was really interesting was when the dataset had more than 6k or so; the deep learning model was suddenly much more accurate and by a wide gap! At roughly the 10k mark, the DL model was outperforming the tree model easily.

beernet|3 years ago

It depends on the "DL model", which is a highly vague term. Both a model with 10K parameters and a model with 10T parameters fit this description equally Well.