[,1] [,2]
[1,] 50.3 -49.7
[2,] -49.7 50.3
MSSC 6250 Statistical Machine Learning
\({\bf X'X} = \begin{bmatrix} 1 & 0.99 \\ 0.99 & 1 \end{bmatrix}\) \(\quad ({\bf X'X})^{-1} = \begin{bmatrix} 50.3 & -49.7 \\ -49.7 & 50.3 \end{bmatrix}\)
\(\mathrm{Var}(b_j) = 50.3\sigma^2\)
\(\beta_1 = \beta_2 = 1\) and \(\mathrm{Corr}(x_1, x_2) = 0.99\)
The relatively “flat” valley in the objective function walks along the eigen-vector of \({\bf X}'{\bf X}\) having the smallest eigen-value.
The optimizer could land anywhere along the valley, leading to large variance of the LSE.
Over many simulation runs, the LSE lies around the line of \(\beta_1 + \beta_2 = 2\), the direction of the eigen-vector of the smallest eigen-value.
OLS stays well in the world of “large-\(n\)-small-\(p\)”.
When \(p > n\), \({\bf X}'{\bf X}\) is not invertible.
There is no unique \(\boldsymbol \beta\) estimate.
Intuition: Too many degrees of freedom (\(p\)) to specify a model, but not enough information (\(n\)) to decide which one is the one.
Make \({\bf X}'{\bf X}\) invertible when \(p > n\) by regularizing coefficient behavior!
A good estimator balances bias and variance well, or minimizes the mean square error \[\text{MSE}(\hat{\beta}) = E[(\hat{\beta} - \beta)^2] = \mathrm{Var}(\hat{\beta}) + \text{Bias}(\hat{\beta})^2\]
Find a biased estimator \(\hat{\boldsymbol \beta}\) that has smaller variance and MSE than the LSE \({\bf b}\).
\[\widehat{\boldsymbol \beta}^\text{r} = (\mathbf{X}' \mathbf{X}+ n \lambda \mathbf{I})^{-1} \mathbf{X}' \mathbf{y},\]
\[ \begin{align} \widehat{\boldsymbol \beta}^\text{r} =& \, \mathop{\mathrm{arg\,min}}_{\boldsymbol \beta} \underbrace{\lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2}_{SS_{res}} + n \lambda \lVert\boldsymbol \beta\rVert_2^2\\ =& \, \mathop{\mathrm{arg\,min}}_{\boldsymbol \beta} \underbrace{\frac{1}{n} \sum_{i=1}^n (y_i - x_i'\boldsymbol \beta)^2}_{\text{MSE}_{\texttt{Tr}}} + \lambda \sum_{j=1}^p \beta_j^2\\ =& \, \mathop{\mathrm{arg\,min}}_{\boldsymbol \beta} \color{green} - \text{ goodness of fit } + \text{ model complexity/flexibility} \end{align} \]
\[ \begin{align} \widehat{\boldsymbol \beta}^\text{r} =& \mathop{\mathrm{arg\,min}}_{\boldsymbol \beta} \frac{1}{n} \sum_{i=1}^n (y_i - x_i'\boldsymbol \beta)^2 + \lambda \sum_{j=1}^p \beta_j^2, \end{align} \]
Properties of ridge regression:
\[\lambda \lVert \boldsymbol \beta\rVert^2_2 = \lambda \boldsymbol \beta' \mathbf{I}\boldsymbol \beta\]
The penalty contour is circle-shaped
Force the objective function to be more convex
A more stable or less varying solution
When \(b_j\) are variable or \(p > n\), ridge regression could have lower test MSE and better predictive performance.
Squared bias (black)
Variance (green)
\(\text{MSE}_{\texttt{Tr}}\) (purple)
MASS::lm.ridge()
The lambda
parameter lm.ridge()
actually specifies the \(n\lambda\) in our notation.
OLS is scale equivalent: \(X_jb_j\) remains the same regardless of how \(X_j\) is scaled.
Ridge coefficient estimates can change substantially when multiplying a given predictor by a constant. \(X_j\hat{\beta}^{r}_{j, \lambda}\) depends on \(\lambda\), the scaling of \(X_j\), and even the scaling of other predictors.
Standardize all predictors!1
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
cyl disp hp drat wt qsec vs
16.53766 -0.16240 0.00233 -0.01493 0.92463 -2.46115 0.49259 0.37465
am gear carb
2.30838 0.68572 -0.57579
lm.ridge()
coef(fit)
transforms these back to the original scale.
fit$coef
shows the coefficients of the standardized predictors.
lm.ridge()
cyl disp hp drat wt qsec vs am gear carb
-0.285 0.285 -1.008 0.487 -2.370 0.866 0.186 1.134 0.498 -0.915
lm.ridge()
uses \(n\) instead of \(n-1\) when calculating the standard deviation.lm.ridge()
cyl disp hp drat wt qsec vs am gear carb
-0.285 0.285 -1.008 0.487 -2.370 0.866 0.186 1.134 0.498 -0.915
cyl disp hp drat wt qsec vs am gear carb
[1,] -0.285 0.285 -1.01 0.487 -2.37 0.866 0.186 1.13 0.498 -0.915
modified HKB estimator is 2.59
modified L-W estimator is 1.84
smallest value of GCV at 15
Hoerl, Kennard, and Baldwin (1975): \(\lambda \approx \frac{p \hat{\sigma}^2}{{\bf b}'{\bf b}}\)
Lawless and Wang (1976): \(\lambda \approx \frac{np \hat{\sigma}^2}{{\bf b'X}'{\bf Xb}}\)
Golub, Health, and Wahba (1979): Generalized Cross Validation
Cross Validation (CV) is a resampling method.
Resampling methods refit a model of interest to data sampled from the training set.
CV can be used to
Randomly divide the training data into \(k\) folds, of approximately equal size.
Use 1 fold for validation to compute MSE, and the remaining \(k - 1\) partitions for training.
Repeat \(k\) times. Each time a different fold is treated as a validation set.
Average \(k\) metrics, \(\text{MSE}_{CV} = \frac{1}{k}\sum_{i=1}^k\text{MSE}_i\).
Use the CV estimate \(\text{MSE}_{CV}\) to select the “best” tuning parameter.
glmnet
package
The parameter alpha
controls the ridge (alpha = 0
) and lasso (alpha = 1
) penalties.
Supply a decreasing sequence of lambda
values.
lm.ridge()
use \(SS_{res}\) and \(n\lambda\), while glmnet()
use \(\text{MSE}_{\texttt{Tr}}\) and \(\lambda\).
Argument x
should be a matrix.
Note
lm.ridge()
and glmnet()
coefficients do not match exactly, specially when transforming back to original scale.cv.glmnet()
1
[1] 2.75
[1] 16.1
11 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 21.11734
cyl -0.37134
disp -0.00525
hp -0.01161
drat 1.05477
wt -1.23420
qsec 0.16245
vs 0.77196
am 1.62381
gear 0.54417
carb -0.54742
The generalized cross-validation (GCV) is a modified version of the leave-one-out CV (\(n\)-fold CV).
The GCV criterion is given by \[\text{GCV}(\lambda) = \frac{1}{n}\sum_{i=1}^n \left[ \frac{y_i - x_i' \widehat{\boldsymbol \beta}^\text{r}_\lambda}{1 - \frac{\text{Trace}(\mathbf{S}_\lambda)}{n}} \right]^2\]
where \(\mathbf{S}_\lambda\) is the hat matrix corresponding to the ridge regression:
\[\mathbf{S}_\lambda = \mathbf{X}(\mathbf{X}' \mathbf{X}+ n\lambda \mathbf{I})^{-1} \mathbf{X}'\]
Select the best \(\lambda\) that produces the smallest GCV.