Trees

Tree based methods are a well known modeling technique that are used for both regression and classification. The general idea is that we segment the feature space into individual subspaces. The rules for segmenting the data into their respective subspaces is summarized by the tree, and is why tree methods are sometimes called decision tree methods. Tree methods have numerous different approaches, such as bagging, boosting and random forests.

Common Applications

Common Problem Types

  • Regression

  • Classification

  • Rules-based segmentation

  • Problems where interpretability is critical

A Brief History

Tree based methods were first published in the early 1960’s, and have since exploded into a remarkable diversity of techniques and approaches that was aided by the growth of free software and cheaper hardware to implement computations that were challenging to do by hand, but relatively easier for computers. They found themselves sometimes enhancing traditional models such as least squares and logistic regression. If you are interested in a technical overview of the various approaches and a more in depth history, see (2014, Loh).

Code Examples

All of the code examples are written in Python, unless otherwise noted.

Containers

These are code examples in the form of Jupyter notebooks running in a container that come with all the data, libraries, and code you’ll need to run it. Click here to learn why you should be using containers, along with how to do so.

Boosting

Explore gradient boosting, a tree-based method, using XGBoost to analyze hotel customer data.

Quickstart: Download Docker, then run the commands below in a terminal.
#pull container, only needs to be run once
docker pull ghcr.io/thedatamine/starter-guides:boosting

#run container
docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:boosting

Need help implementing any of this code? Feel free to reach out to datamine-help@purdue.edu and we can help!