top | item 19636202

(no title)

savagedata | 6 years ago

I wrote this regression tree tutorial a few years back that might be a good complement to the tutorial above since it covers regression instead of classification and goes on to talk about bagging vs random forest, out-of-bag samples, and tuning parameters: https://github.com/savagedata/regression-tree-tutorial I wrote it at the start of my career and haven't shared it beyond my study group, so I'm happy to hear feedback.

discuss

order

anthony_doan|6 years ago

It's a really good tutorial.

I like how you talk about Conditional Inference. My thesis is suppose to overcome the brute force of exhaustive search for best splits that Random Forest does (I use Dr. Loh's GUIDE trees) using statistical methods.

> Many implementations of random forest default to 1/3 of your predictor variables.

This is interesting. I hear it was sqroot(total number of predictors).

> Ensemble methods combine many individual trees to create one better, more stable model.

I think stable can be more clarify to having good training accuracy and low generalize error (unseen data error rate) compare to individual tree. This is what Dr. Ho talk about with forest.

But other than that I think it's an awesome tutorial.

I've seen what other tree and forest do for better generalization with unseen data is pruning is using CV and choosing 0.5 to 1.0 std error as a cut off point. That may be a thing to talk about if you are interested in that.

savagedata|6 years ago

Thank you for the useful feedback! I'll have to look up GUIDE trees.

> This is interesting. I hear it was sqroot(total number of predictors).

I was probably looking at the randomForest R package documentation [1], which says:

> mtry Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)

I checked the H2O implementation of random forest [2] and they use the same defaults.

I'll add a note about the one third default being specific to regression since that seems like an important distinction.

[1] https://www.rdocumentation.org/packages/randomForest/version...

[2] http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/d...

sandGorgon|6 years ago

thanks that's pretty cool !