MSSC 6250 Statistical Machine Learning
Can be used for regression and classification.
IDEA: Segmenting the predictor space into many simple regions.
Simple, useful for interpretation, and has nice graphical representation.
Not competitive with the best supervised learning approaches in terms of prediction accuracy. (Large bias)
Combining a large number of trees (ensembles) often results in improvements in prediction accuracy, at the expense of some loss interpretation.
CART is a nonparametric method that recursively partitions the feature space into hyper-rectangular subsets (boxes), and make prediction on each subset.
Divide the predictor space — the set of possible values for \(X_1, X_2, \dots, X_p\) — into \(J\) distinct and non-overlapping regions, \(R_1, R_2, \dots, R_J\).
KNN requires K and a distance measure.
SVM requires kernels.
Tree solves this by recursively partitioning the feature space using a binary splitting rule \(\mathbf{1}\{x \le c \}\)
0: Red; 1: Blue
If \(x_2 < -0.64\), \(y = 0\).
If \(x_2 \ge -0.64\) and \(x_1 \ge 0.69\), \(y = 0\).
If \(x_2 \ge -0.64\), \(x_1 < 0.69\), and \(x_2 \ge 0.75\), \(y = 0\).
If \(x_2 \ge -0.64\), \(x_1 < 0.69\), \(x_2 < 0.75\), and \(x_1 < -0.69\), \(y = 0\).
Step 5 may not be beneficial.
Step 6 may not be beneficial. (Could overfit)
Step 7 may not be beneficial. (Could overfit)
The classification error rate is the fraction of the training observations in that region that do not belong to the most common class: \[1 - \max_{k} (\hat{p}_{mk})\] where \(\hat{p}_{mk}\) is the proportion of training observations in the \(m\)th region that are from the \(k\)th class.
Classification error is not sensitive for tree-growing.
Ideally hope to have nodes (regions) including training points that belong to only one class.
The Gini index is defined by
\[\sum_{k=1}^K \hat{p}_{mk}(1 - \hat{p}_{mk})\] which is a measure of total variance across the K classes.
Gini is small if all of the \(\hat{p}_{mk}\)s are close to zero or one.
Node purity: a small value indicates that a node contains predominantly observations from a single class.
The Shannon entropy is defined as
\[- \sum_{k=1}^K \hat{p}_{mk} \log(\hat{p}_{mk}).\] - The entropy is near zero if the \(\hat{p}_{mk}\)s are all near zero or one.
The goal is to find boxes \(R_1, \dots ,R_J\) that minimize the \(SS_{res}\), given by \[\sum_{j=1}^J\sum_{i \in R_j}\left( y_i - \hat{y}_{R_j}\right)^2\] where \(\hat{y}_{R_j}\) is the mean response for the training observations within \(R_j\).
Given the largest tree \(T_{max}\),
\[\begin{align} \min_{T \subset T_{max}} \sum_{m=1}^{|T|}\sum_{i:x_i\in R_m} \left( y_i - \hat{y}_{R_m}\right)^2 + \alpha|T| \end{align}\] where \(|T|\) indicates the number of terminal nodes of the tree \(T\).
Large \(\alpha\) results in small trees
Choose \(\alpha\) using CV
Algorithm 8.1 in ISL for building a regression tree.
Replace \(SS_{res}\) with misclassification rate for classification.
Linear regression
\[f(X) = \beta_0 + \sum_{j=1}^pX_j\beta_j\]
Regression tree
\[f(X) = \sum_{j=1}^J \hat{y}_{R_j}\mathbf{1}(\mathbf{X}\in R_j)\]
Performs better when there is a highly nonlinear and complex relationship between \(y\) and \(x\).
Preferred for interpretability and visualization.
Two heads are better than one, not because either is infallible, but because they are unlikely to go wrong in the same direction. – C.S. Lewis, British Writer (1898 - 1963)
『三個臭皮匠,勝過一個諸葛亮』
An ensemble method combines many weak learners (unstable, less accurate) to obtain a single and powerful model.
The CARTs suffer from high variance.
If independent \(Z_1, \dots, Z_n\) have variance \(\sigma^2\), then \(\bar{Z}\) has variance \(\sigma^2/n\).
Averaging a set of observations reduces variance!
With \(B\) separate training sets,
\[\hat{f}_{avg}(x) = \frac{1}{B}\sum_{b=1}^B\hat{f}_{b}(x)\]
Bootstrap aggregation, or bagging is a procedure for reducing variance.
Generate \(B\) bootstrap samples by repeatedly sampling with replacement from the training set \(B\) times.
\[\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B\hat{f}^*_{b}(x)\]
For CART, the decision line has to be aligned to axis.
For Bagging, \(B = 200\) each having 400 training points. Boundaries are smoother.
Using a very large \(B\) will not lead to overfitting.
Use \(B\) sufficiently large that the error has settled down.
Bagging improves prediction accuracy at the expense of interpretability.
When different trees are highly correlated, simply averaging is not very effective.
If there is one very strong predictor in the data set, in the collection of bagged trees, most or all of the trees will use this strong predictor in the top split. Therefore, all of the bagged trees will look quite similar to each other.
The predictions from the bagged trees will be highly correlated, and hence averaging does not lead to as large reduction in variance.
Random forests improve bagged trees by decorrelating the trees.
\(m\) predictors are randomly sampled as split candidates from the \(p\) predictors.
\(m \approx \sqrt{p}\) for classification; \(m \approx p/3\) for regression.
Decorrelating: on average \((p − m)/p\) of the splits will not even consider the strong predictor, and so other predictors will have more of a chance.
If \(m = p\), random forests = bagging.
The improvement is significant when \(p\) is large.
randomForest::randomForest(x, y, mtry, ntree, nodesize, sampsize)
Bagging trees are built on bootstrap data sets, independent with each other.
Boosting1 trees are grown sequentially: each tree is grown using information from previously grown trees.
Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set, the residuals!
Base (weak) learners: Could any simple model (high-bias low-variance). Usually a decision tree.
Training weak models: Fit a (shallow) tree \(\hat{f}^b\) with relatively few \(d\) splits
Sequential Training w.r.t. residuals: Fitting each tree in the sequence to the previous tree’s residuals.
Suppose our final tree is \(\hat{f}(x)\) that starts with \(0\).
distribution = "bernoulli"
: LogitBoost
AdaBoost (Adaptive Boosting) gbm(y ~ ., distribution = "adaboost")
Gradient Boosting/Extreme Gradient Boosting (XGBoost) xgboost
Bayesian Additive Regression Trees (BART) (ISL Sec. 8.2.4)