MSSC 6250 Statistical Machine Learning
So far we have mostly focused on parametric models, either unconditional \(p({\bf y} \mid \boldsymbol \theta)\) or conditional \(p({\bf y} \mid \mathbf{x}, \boldsymbol \theta)\), where \(\boldsymbol \theta\) is a fixed-dimensional vector of parameters. 1
The parameters are estimated from the training set \(\mathcal{D} = \{(\mathbf{x}_i, \mathbf{y}_i)\}_{i=1}^n\) but after model fitting, the data is not used anymore.
K-nearest neighbor (KNN) is a nonparametric method.
It can be used for both regression and classification.
In KNN, we don’t have parameters \(\boldsymbol \beta\), and \(f(\mathbf{x}_0) = \mathbf{x}_0'\boldsymbol \beta\) in regression.1
We directly estimate \(f(\mathbf{x}_0)\) using our examples or memory.
\[ \widehat{y}_0 = \frac{1}{k} \sum_{x_i \in N_k(x_0)} y_i,\] where the neighborhood of \(x_0\), \(N_k(x_0)\), defines the \(k\) training data points that are closest to \(x_0\).
\(k\) determines the model complexity and degrees of freedom (df).
In general, the df can be defined as
\[\text{df}(\hat{f}) = \frac{1}{\sigma^2}\text{Trace}\left( \mathrm{Cov}(\hat{\mathbf{y}}, \mathbf{y})\right)= \frac{1}{\sigma^2}\sum_{i=1}^n \mathrm{Cov}(\hat{y}_i, y_i)\]
\(k = 1\): \(\hat{f}(x_i) = y_i\) and \(\text{df}(\hat{f}) = n\)
\(k = n\): \(\hat{f}(x_i) = \bar{y}\) and \(\text{df}(\hat{f}) = 1\)
For general \(k\), \(\text{df}(\hat{f}) = n/k\).
Linear regression with \(p\) coefficients: \(\text{df}(\hat{f}) = \text{Trace}\left( {\bf H} \right) = p\)
For any linear smoother \(\hat{\mathbf{y}} = {\bf S} \mathbf{y}\), \(\text{df}(\hat{f}) = \text{Trace}({\bf S})\).
Look for the most popular class label among its neighbors.
The KNN decision boundary is nonlinear.
R: class::knn()
, kknn::kknn()
, FNN::knn()
, parsnip::nearest_neighbor()
knn_fit <- class::knn(train = x, test = x, cl = y, k = 15)
caret::confusionMatrix(table(knn_fit, y))
Confusion Matrix and Statistics
y
knn_fit 0 1
0 82 13
1 18 87
Accuracy : 0.845
95% CI : (0.787, 0.892)
No Information Rate : 0.5
P-Value [Acc > NIR] : <2e-16
Kappa : 0.69
Mcnemar's Test P-Value : 0.472
Sensitivity : 0.820
Specificity : 0.870
Pos Pred Value : 0.863
Neg Pred Value : 0.829
Prevalence : 0.500
Detection Rate : 0.410
Detection Prevalence : 0.475
Balanced Accuracy : 0.845
'Positive' Class : 0
set.seed(2024)
library(caret)
control <- trainControl(method = "cv", number = 10)
knn_cvfit <- train(y ~ ., method = "knn",
data = data.frame("x" = x, "y" = as.factor(y)),
tuneGrid = data.frame(k = seq(1, 40, 1)),
trControl = control)
par(mar = c(4, 4, 0, 0))
plot(knn_cvfit$results$k, 1 - knn_cvfit$results$Accuracy,
xlab = "K", ylab = "Classification Error", type = "b",
pch = 19, col = 2)
\[d^2(\mathbf{u}, \mathbf{v}) = \lVert \mathbf{u}- \mathbf{v}\rVert_2^2 = \sum_{j=1}^p (u_j - v_j)^2\]
\[d^2(\mathbf{u}, \mathbf{v}) = \sum_{j=1}^p \frac{(u_j - v_j)^2}{\sigma_j^2}\]
\[d^2(\mathbf{u}, \mathbf{v}) = (\mathbf{u}- \mathbf{v})' \Sigma^{-1} (\mathbf{u}- \mathbf{v}),\]
Red and green points have the same Euclidean distance to the center.
The red point is farther away from the center in terms of Mahalanobis distance.
ElemStatLearn::zip.train
Digits 0-9 scanned from envelopes by the U.S. Postal Service
\(16 \times 16\) pixel images, totally \(p=256\) variables
At each pixel, we have the gray scale as the numerical value
1NN with Euclidean distance gives 5.6% error rate
1NN with tangent distance (Simard et al., 1993) gives 2.6% error
# fit 3nn model and calculate the error
knn.fit <- class::knn(zip.train[, 2:257], zip.test[, 2:257], zip.train[, 1], k = 3)
# overall prediction error
mean(knn.fit != zip.test[, 1])
[1] 0.0533
Need to store the entire training data for prediction. (Lazy learner)
Needs to calculate the distance from \(x_0\) to all training sample and sort them.1
Distance measures may affect accuracy.
As \(p\) increases, it’s getting harder to find \(k\) neighbors in the input space. KNN needs to explore a large range of values along each input dimension to grab the “neighbors”.
The “neighbors” of \(x_0\) are in fact far away from \(x_0\), and so they may not be good predictors about the behavior of the function at \(x_0\).
The method is not local anymore despite the name “nearest neighbor”!
In high dimensions KNN often performs worse than linear regression.