top | item 47204964

Decision trees – the unreasonable power of nested decision rules

339 points| mschnell | 12 hours ago |mlu-explain.github.io

62 comments

order

srean|8 hours ago

A 'secret weapon' that has served me very well for learning classifiers is to first learn a good linear classifier. I am almost hesitant to give this away (kidding).

Use the non-thresholded version of that linear classifier output as one additional feature-dimension over which you learn a decision tree. Then wrap this whole thing up as a system of boosted trees (that is, with more short trees added if needed).

One of the reasons why it works so well, is that it plays to their strengths:

(i) Decision trees have a hard time fitting linear functions (they have to stair-step a lot, therefore need many internal nodes) and

(ii) linear functions are terrible where equi-label regions have a recursively partitioned structure.

In the decision tree building process the first cut would usually be on the synthetic linear feature added, which would earn it the linear classifier accuracy right away, leaving the DT algorithm to work on the part where the linear classifier is struggling. This idea is not that different from boosting.

One could also consider different (random) rotations of the data to form a forest of trees build using steps above, but was usually not necessary. Or rotate the axes so that all are orthogonal to the linear classifier learned.

One place were DT struggle is when the features themselves are very (column) sparse, not many places to place the cut.

whatever1|2 hours ago

If you think about it, this what reinforcement learning folks are doing. They use the vanilla state and then they lift it to the observed state by doing some additional calculation with the original state data.

For example you start with the raw coordinates of snake in a snake game, but you now can calculate how many escape routes the snake has, and train on it.

3abiton|4 hours ago

I think it's worth mentioning, that the achilles heel of DT, is in fact, data (more specifically feature) engineering. If one does not spend significant time cleaning and engineering the features, the results would be much worse than, say a "black box" model, like NN. This is the catch. Ironically, NN can detect such latent features, but very difficult to interpret why.

u1hcw9nx|6 hours ago

There are decision trees for what you want do do.

Oblique Decision trees, Model Trees. (M5 Trees for example), Logistic Model Trees (LMT) or Hierarchical Mixture of Experts (HME).

ekjhgkejhgk|7 hours ago

> (ii) linear functions are terrible where equi-label regions have a partitioned structure.

Could you explain what "equi-label regions having a partitioned structure" mean?

selimthegrim|1 hour ago

Is this not the heart of the IRM paper by Arjovsky, Bottou et al.?

lokimedes|8 hours ago

When I worked at CERN around 2010, Boosted Decision Trees were the most popular classifier, exactly due to the (potential for) explainability along with its power of expression. We had a cultural aversion for neural networks back then, especially if the model was used in physics analysis directly. Times have changed…

ekjhgkejhgk|4 hours ago

I used to be in physics but theory, not experiment. I have experience at work with decision trees in a different field.

I've always thought that the idea that decision trees are "explainable" is very overstated. The moment that you go past a couple of levels in depth, it becomes an un-interpretable jungle. I've actually done the exercise of inspecting how a 15-depth decision trees makes decision, and I found it impossible to interpret anything.

In a neural network you can also follow the successive matrix multiplications and relu etc through the layers, but you end up not knowing how the decision is made.

Thoughts?

srean|7 hours ago

> Times have changed…

This makes me a little concerned -- the use of parameters rich opaque models in Physics.

Ptolemaic system achieved a far better fit of planetary motion (over the Copernican system) because his was a universal approximator. Epicyclic system is a form of Fourier analysis and hence can fit any smooth periodic motion. But the epicycles were not the right thing to use to work out the causal mechanics, in spite of being a better fit empirically.

In Physics we would want to do more than accurate curve fitting.

wodenokoto|8 hours ago

Are boosted decision trees the same as a boosted random forest?

fooker|10 hours ago

Fun fact - single bit neural networks are decision trees.

In theory, this means you can 'compile' most neural networks into chains of if-else statements but it's not well understood when this sort of approach works well.

smokel|7 hours ago

> single bit neural networks are decision trees.

I didn't exactly understood what was meant here, so I went out and read a little. There is an interesting paper called "Neural Networks are Decision Trees" [1]. Thing is, this does not imply a nice mapping of neural networks onto decision trees. The trees that correspond to the neural networks are huge. And I get the idea that the paper is stretching the concept of decision trees a bit.

Also, I still don't know exactly what you mean, so would you care to elaborate a bit? :)

[1] https://arxiv.org/pdf/2210.05189

Almondsetat|9 hours ago

Do you know of any software that does this? Or any papers on the matter? It could be a fun weekend project

zelphirkalt|9 hours ago

Decision trees are great. My favorite classical machine learning algorithm or group of algorithms, as there are many slight variations of decision trees. I wrote a purely functional (kind of naive) parallelized implementation in GNU Guile: https://codeberg.org/ZelphirKaltstahl/guile-ml/src/commit/25...

Why "naive"? Because there is no such thing as NumPy or data frames in the Guile ecosystem to my knowledge, and the data representation is therefore probably quite inefficient.

srean|8 hours ago

What benefit does numpy or dataframes bring to decision tree logic over what is available in Guile already ? Honest question.

Guile like languages are very well suited for decision trees, because manipulating and operating on trees is it's mother tongue. Only thing that would be a bit more work would be to compile the decision tree into machine code. Then one doesn't have traverse a runtime structure, the former being more efficient.

BTW take a look at Lush, you might like it.

https://lush.sourceforge.net/

https://news.ycombinator.com/item?id=2406325

If you are looking for vectors and tensors in Guile, there is this

https://wedesoft.github.io/aiscm/

jebarker|4 hours ago

The killer feature of DTs is how fast they can be. I worked very hard on a project to try and replace DT based classifiers with small NNs in a low latency application. NNs could achieve non-trivial gains in classification accuracy but remained two orders of magnitude higher latency at inference time.

kqr|9 hours ago

Experts' nebulous decision making can often be modelled with simple decision trees and even decision chains (linked lists). Even when the expert thinks their decision making is more complex, a simple decision tree better models the expert's decision than the rules proposed by the experts themselves.

I've long dismissed decision trees because they seem so ham-fisted compared to regression and distance-based clustering techniques but decision trees are undoubtedly very effective.

See more in chapter seven of the Oxford Handbook of Expertise. It's fascinating!

ablob|9 hours ago

I once saw a visualization that basically partitioned decisions on a 2D plane. From that perspective, decision trees might just be a fancy word for kD-Trees partitioning the possibility space and attaching an action to the volumes.

Given that assumption, the nebulous decision making could stem from expert's decisions being more nuanced in the granularity of the surface separating 2 distinct actions. It might be a rough technique, but nonetheless it should be able to lead to some pretty good approximations.

getpokedagain|6 hours ago

I worked (professionally) on a product a few years ago based upon decision tree and random forest classifiers. I had no background in the math and had to learn this stuff which has payed dividends as llms and AI have become hyped. This is one of the best explanations I've seen and has me super nostalgic for that project.

Gonna try to cook up something personal. It's amazing how people are now using regression models basically all the time and yet no-one uses these things on their own.

jjcc|5 hours ago

I worked on a product which was the best ID reader in the world at the time 25 years ago. The OCR engine was based on Decision tree and "Random Forest" (I suspect the name did exist) with only 3 trees. It was very effective as a secret weapon of the competitiveness. I tried to train a NN with a framework called SNNS(Stuttgart Neural Network Simulator) as the 4th tree complement to the existing 3.

Today, hand writing OCR is a "hello world" sample in Tensorflow.

xmprt|11 hours ago

Interesting website and great presentation. My only note is that the color contrast of some of the text makes it hard to read.

thesnide|10 hours ago

exactly my thought. and here thr reader view of FF is a godsend.

having 'accessible' content is not only for people with disabilities, it also help with bad color taste.

well, at least bad taste for readable content ;)

ssttoo|1 hour ago

I just wish we’d stop with the “unreasonable” click-bite. Cheapens an otherwise excellent article, like “7 x (number 6 will surprise you)” of yesteryear

moi2388|10 hours ago

That was beautifully presented!

EGreg|6 hours ago

Isn’t that exactly how humans (and even animals) operate?

Human societies look for actual major correlations and establish classifications. Except with scientific-minded humans, we often also want, to know the why behind the correlations. David Hume got involved w that… https://brainly.com/question/50372476

Let me ask a provocative question. What, ultimately, is the difference between knowledge and bias?