Advanced Machine Learning with R
上QQ阅读APP看书,第一时间看更新

Understanding a regression tree

To establish an understanding of tree-based methods, it's probably easier to start with a quantitative outcome and then move on to how it works in a classification problem. The essence of a tree is that the features are partitioned, starting with the first split that improves the RSS the most. These binary splits continue until the termination of the tree. Each subsequent split/partition isn't done on the entire dataset, but only on the portion of the prior split that it falls under. This top-down process is referred to as recursive partitioning. It's also a process that's greedy, a term you may stumble upon in reading about machine learning (ML) methods. Greedy means that during each split in the process, the algorithm looks for the greatest reduction in the RSS without any regard to how well it will perform on the later partitions. The result is that you may end up with a full tree of unnecessary branches leading to a low bias but a high variance. To control this effect, you need to appropriately prune the tree to an optimal size after building a full tree.

This diagram provides a visualization of this technique in action:

Regression tree with three splits and four terminal nodes, and the corresponding node average and number of observations

The data is hypothetical with 30 observations, a response ranging from 1 to 10, and two predictor features, both ranging in value from 0 to 10 named X1 and X2. The tree has three splits leading to four terminal nodes. Each split is basically an if...then statement or uses the R syntax ifelse(). The first split is: if X1 is less than 3.5, then the response is split into four observations with an average value of 2.4 and the remaining 26 observations. The left branch of four observations is a terminal node as any further splits would not substantially improve the RSS. The predicted value for these four observations is that the partition of the tree becomes the average. The next split is at X2 < 4, and finally X1 < 7.5.

An advantage of this method is that it can handle highly nonlinear relationships; however, can you see a couple of potential problems? The first issue is that an observation is given the average of the terminal node under which it falls. This can hurt the overall predictive performance (high bias). Conversely, if you keep partitioning the data further and further to achieve a low bias, a high variance can become an issue. As with the other methods, you can use cross-validation to select the appropriate tree depth size.