MSSC 6250 - Spring 2024 - Ridge Regression and Cross Validation r fontawesome::fa("wave-square")

When Ordinary Least Squares (OLS) Doesn’t Work Well: Collinearity

When predictors are highly correlated, $\mathrm{Var}(b_j)$ is much inflated.¹
- A tiny change in the training data causes a large change in $b_j$, leading to unreliable estimation and prediction.

${\bf X'X} = \begin{bmatrix} 1 & 0.99 \\ 0.99 & 1 \end{bmatrix}$ $\quad ({\bf X'X})^{-1} = \begin{bmatrix} 50.3 & -49.7 \\ -49.7 & 50.3 \end{bmatrix}$
$\mathrm{Var}(b_j) = 50.3\sigma^2$
- An increase in 50-fold over the ideal case when the two regressors are orthogonal.

solve(matrix(c(1, 0.99, 0.99, 1), 2, 2))

      [,1]  [,2]
[1,]  50.3 -49.7
[2,] -49.7  50.3

When OLS Doesn’t Work Well: Optimization Perspective

$\beta_1 = \beta_2 = 1$ and $\mathrm{Corr}(x_1, x_2) = 0.99$
The relatively “flat” valley in the objective function walks along the eigen-vector of ${\bf X}'{\bf X}$ having the smallest eigen-value.

eigen(
    matrix(c(1, 0.99, 0.99, 1), 
             2, 2)
    )

eigen() decomposition
$values
[1] 1.99 0.01

$vectors
      [,1]   [,2]
[1,] 0.707 -0.707
[2,] 0.707  0.707

When OLS Doesn’t Work Well: Optimization Perspective

A little change in the training set perturbs the objective function. The LSEs lie on a valley centered around the truth.

When OLS Doesn’t Work Well: High Variance

The optimizer could land anywhere along the valley, leading to large variance of the LSE.
Over many simulation runs, the LSE lies around the line of $\beta_1 + \beta_2 = 2$, the direction of the eigen-vector of the smallest eigen-value.

When OLS Doesn’t Work Well: Large-$p$-small-$n$ (High Dimensional Data)

OLS stays well in the world of “large-$n$-small-$p$”.
When $p > n$, ${\bf X}'{\bf X}$ is not invertible.
There is no unique $\boldsymbol \beta$ estimate.

Intuition: Too many degrees of freedom ($p$) to specify a model, but not enough information ($n$) to decide which one is the one.

Too flexible and ends up with overfitting

Remedy for Large Variance and Large-$p$-small-$n$

Make ${\bf X}'{\bf X}$ invertible when $p > n$ by regularizing coefficient behavior!
A good estimator balances bias and variance well, or minimizes the mean square error \[\text{MSE}(\hat{\beta}) = E[(\hat{\beta} - \beta)^2] = \mathrm{Var}(\hat{\beta}) + \text{Bias}(\hat{\beta})^2\]
Find a biased estimator $\hat{\boldsymbol \beta}$ that has smaller variance and MSE than the LSE ${\bf b}$.

Ridge Regression

Idea: Add a ridge (diagonal matrix) to ${\bf X} ' {\bf X}$.¹

\[\widehat{\boldsymbol \beta}^\text{r} = (\mathbf{X}' \mathbf{X}+ n \lambda \mathbf{I})^{-1} \mathbf{X}' \mathbf{y},\]

To regularize coefficient behavior, we add an $\ell_2$ penalty to the residual sum of squares, for some tuning parameter $\lambda > 0$.

\[ \begin{align} \widehat{\boldsymbol \beta}^\text{r} =& \, \mathop{\mathrm{arg\,min}}_{\boldsymbol \beta} \underbrace{\lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2}_{SS_{res}} + n \lambda \lVert\boldsymbol \beta\rVert_2^2\\ =& \, \mathop{\mathrm{arg\,min}}_{\boldsymbol \beta} \underbrace{\frac{1}{n} \sum_{i=1}^n (y_i - x_i'\boldsymbol \beta)^2}_{\text{MSE}_{\texttt{Tr}}} + \lambda \sum_{j=1}^p \beta_j^2\\ =& \, \mathop{\mathrm{arg\,min}}_{\boldsymbol \beta} \color{green} - \text{ goodness of fit } + \text{ model complexity/flexibility} \end{align} \]

Properties of Ridge Regression

\[ \begin{align} \widehat{\boldsymbol \beta}^\text{r} =& \mathop{\mathrm{arg\,min}}_{\boldsymbol \beta} \frac{1}{n} \sum_{i=1}^n (y_i - x_i'\boldsymbol \beta)^2 + \lambda \sum_{j=1}^p \beta_j^2, \end{align} \]

Properties of ridge regression:

Has less degrees of freedom in the sense that the cost increases when large coefficients are used.

Favors small-magnitude coefficient estimates to avoid cost penalty. (Shrinkage)

The shrinkage parameter $\lambda$ controls the degree of penalty.
- $\lambda \rightarrow 0$: No penalty, and $\widehat{\boldsymbol \beta}^\text{r} = \bf b$.
- $\lambda \rightarrow \infty$: Unbearable penalty, and $\widehat{\boldsymbol \beta}^\text{r} \rightarrow \mathbf{0}$

Ridge Penalty

\[\lambda \lVert \boldsymbol \beta\rVert^2_2 = \lambda \boldsymbol \beta' \mathbf{I}\boldsymbol \beta\]

The penalty contour is circle-shaped
Force the objective function to be more convex
A more stable or less varying solution

Geometrical Meaning of Ridge Regression

More Convex Loss Function

Adding a ridge penalty forces the objective to be more convex due to the added eigenvalues.

eigen(matrix(c(1, 0.99, 0.99, 1), 2, 2) + diag(2))[1]

$values
[1] 2.99 1.01

The Bias-variance Tradeoff

As $\lambda \downarrow$, bias $\downarrow$ and variance $\uparrow$
As $\lambda \uparrow$, bias $\uparrow$ and variance $\downarrow$

This effect is gradually changing as we increase the penalty level. The following simulation shows how the variation of $\boldsymbol \beta$ changes. We show this with two penalty values, and see how the estimated parameters are away from the truth.

Now, we may ask the question: is it worth it? In fact, this bias and variance will be then carried to the predicted values $x^\text{T}\widehat{\boldsymbol \beta}^\text{ridge}$. Hence, we can judge if this is beneficial from the prediction accuracy. And we need some procedure to do this.

Remark: The bias-variance trade-off will appear frequently in this course. And the way to evaluate the benefit of this is to see if it eventually reduces the prediction error ($\text{Bias}^2 + \text{Variance}$ plus a term called irreducible error, which will be introduced in later chapter).

Lower Test MSE

When $b_j$ are variable or $p > n$, ridge regression could have lower test MSE and better predictive performance.
Squared bias (black)
Variance (green)
$\text{MSE}_{\texttt{Tr}}$ (purple)

`MASS::lm.ridge()`

The lambda parameter lm.ridge() actually specifies the $n\lambda$ in our notation.
OLS is scale equivalent: $X_jb_j$ remains the same regardless of how $X_j$ is scaled.
Ridge coefficient estimates can change substantially when multiplying a given predictor by a constant. $X_j\hat{\beta}^{r}_{j, \lambda}$ depends on $\lambda$, the scaling of $X_j$, and even the scaling of other predictors.
Standardize all predictors!¹

head(mtcars, 3)

               mpg cyl disp  hp drat   wt qsec vs am gear carb
Mazda RX4     21.0   6  160 110 3.90 2.62 16.5  0  1    4    4
Mazda RX4 Wag 21.0   6  160 110 3.90 2.88 17.0  0  1    4    4
Datsun 710    22.8   4  108  93 3.85 2.32 18.6  1  1    4    1

(fit <- MASS::lm.ridge(mpg ~ ., data = mtcars, lambda = 1)) ## ridge fit

              cyl     disp       hp     drat       wt     qsec       vs 
16.53766 -0.16240  0.00233 -0.01493  0.92463 -2.46115  0.49259  0.37465 
      am     gear     carb 
 2.30838  0.68572 -0.57579

Scaling Issues of `lm.ridge()`

coef(fit) transforms these back to the original scale.
fit$coef shows the coefficients of the standardized predictors.

coef(fit)

              cyl     disp       hp     drat       wt     qsec       vs 
16.53766 -0.16240  0.00233 -0.01493  0.92463 -2.46115  0.49259  0.37465 
      am     gear     carb 
 2.30838  0.68572 -0.57579

fit$coef

   cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb 
-0.285  0.285 -1.008  0.487 -2.370  0.866  0.186  1.134  0.498 -0.915

Scaling Issues of `lm.ridge()`

fit$coef

   cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb 
-0.285  0.285 -1.008  0.487 -2.370  0.866  0.186  1.134  0.498 -0.915

lm.ridge() uses $n$ instead of $n-1$ when calculating the standard deviation.

# each column has mean 0 and var 1
X <- scale(data.matrix(mtcars[, -1]), center = TRUE, scale = TRUE)

# center y but not scaling
y <- scale(mtcars$mpg, center = TRUE, scale = FALSE)

# lambda = 1
my_ridge_coef <- solve(t(X) %*% X + diag(ncol(X))) %*% t(X) %*% y
t(my_ridge_coef)

        cyl disp    hp  drat    wt qsec   vs   am  gear   carb
[1,] -0.294 0.27 -1.02 0.495 -2.39 0.87 0.19 1.15 0.505 -0.937

Scaling Issues of `lm.ridge()`

fit$coef

   cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb 
-0.285  0.285 -1.008  0.487 -2.370  0.866  0.186  1.134  0.498 -0.915

# use n instead of (n-1) for standardization
n <- nrow(X); X <- X * sqrt(n / (n - 1))

        cyl  disp    hp  drat    wt  qsec    vs   am  gear   carb
[1,] -0.285 0.285 -1.01 0.487 -2.37 0.866 0.186 1.13 0.498 -0.915

Ridge Trace

ridge_fit <- lm.ridge(mpg ~ ., data = mtcars, lambda = 0:40)
matplot(coef(ridge_fit)[, -1], type = "l", xlab = "Lambda", ylab = "Coefficients")
text(rep(1, 10), coef(ridge_fit)[1,-1], colnames(mtcars)[2:11])

Select a value of $\lambda$ at which the ridge estimates are stable.

Methods for Choosing $\lambda$

MASS::select(ridge_fit)

modified HKB estimator is 2.59 
modified L-W estimator is 1.84 
smallest value of GCV  at 15

Hoerl, Kennard, and Baldwin (1975): $\lambda \approx \frac{p \hat{\sigma}^2}{{\bf b}'{\bf b}}$
Lawless and Wang (1976): $\lambda \approx \frac{np \hat{\sigma}^2}{{\bf b'X}'{\bf Xb}}$
Golub, Health, and Wahba (1979): Generalized Cross Validation

Cross Validation

Cross Validation (CV) is a resampling method.
Resampling methods refit a model of interest to data sampled from the training set.
CV can be used to
- estimate the test error when there is no test data. (Model assessment)
- select the tuning parameters that controls the model complexity/flexibility. (Model selection)

$k$-Fold Cross Validation

Randomly divide the training data into $k$ folds, of approximately equal size.
Use 1 fold for validation to compute MSE, and the remaining $k - 1$ partitions for training.
Repeat $k$ times. Each time a different fold is treated as a validation set.
Average $k$ metrics, $\text{MSE}_{CV} = \frac{1}{k}\sum_{i=1}^k\text{MSE}_i$.
Use the CV estimate $\text{MSE}_{CV}$ to select the “best” tuning parameter.

Five-fold cross validation. Source: Data science in a box

`glmnet` package

The parameter alpha controls the ridge (alpha = 0) and lasso (alpha = 1) penalties.
Supply a decreasing sequence of lambda values.
lm.ridge() use $SS_{res}$ and $n\lambda$, while glmnet() use $\text{MSE}_{\texttt{Tr}}$ and $\lambda$.
Argument x should be a matrix.

lm.ridge(formula = mpg ~ ., 
         data = mtcars, 
         lambda = 5:0)

glmnet(x = data.matrix(mtcars[, -1]), 
       y = mtcars$mpg, 
       alpha = 0, 
       lambda = 5:0/nrow(mtcars))

Note

lm.ridge() and glmnet() coefficients do not match exactly, specially when transforming back to original scale.
No need to worry too much as we focus on predictive performance.

$k$-Fold Cross Validation using `cv.glmnet()`¹

The $\lambda$ values are automatically selected, on the $\log_{e}$ scale.

ridge_cv_fit <- cv.glmnet(x = data.matrix(mtcars[, -1]), y = mtcars$mpg, alpha = 0,
                          nfolds = 10, type.measure = "mse")
plot(ridge_cv_fit$glmnet.fit, "lambda")

Determine $\lambda$

plot(ridge_cv_fit)

ridge_cv_fit$lambda.min

[1] 2.75

# largest lambda s.t. error is within 1 s.e of the min
ridge_cv_fit$lambda.1se

[1] 16.1

coef(ridge_cv_fit, s = "lambda.min")

11 x 1 sparse Matrix of class "dgCMatrix"
                  s1
(Intercept) 21.11734
cyl         -0.37134
disp        -0.00525
hp          -0.01161
drat         1.05477
wt          -1.23420
qsec         0.16245
vs           0.77196
am           1.62381
gear         0.54417
carb        -0.54742

Generalized Cross-Validation

The generalized cross-validation (GCV) is a modified version of the leave-one-out CV ($n$-fold CV).
The GCV criterion is given by \[\text{GCV}(\lambda) = \frac{1}{n}\sum_{i=1}^n \left[ \frac{y_i - x_i' \widehat{\boldsymbol \beta}^\text{r}_\lambda}{1 - \frac{\text{Trace}(\mathbf{S}_\lambda)}{n}} \right]^2\]

where $\mathbf{S}_\lambda$ is the hat matrix corresponding to the ridge regression:

\[\mathbf{S}_\lambda = \mathbf{X}(\mathbf{X}' \mathbf{X}+ n\lambda \mathbf{I})^{-1} \mathbf{X}'\]

Generalized Cross-Validation

plot(ridge_fit$lambda, 
     ridge_fit$GCV, 
     type = "l", col = "darkgreen", 
     ylab = "GCV", xlab = "Lambda", 
     lwd = 3)

idx <- which.min(ridge_fit$GCV)
ridge_fit$lambda[idx]

[1] 15

round(coef(ridge_fit)[idx, ], 2)

        cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
21.13 -0.37 -0.01 -0.01  1.05 -1.23  0.16  0.77  1.62  0.54 -0.55

Select the best $\lambda$ that produces the smallest GCV.

Ridge Regression and Cross Validation

Ridge Regression

When Ordinary Least Squares (OLS) Doesn’t Work Well: Collinearity

When OLS Doesn’t Work Well: Optimization Perspective

When OLS Doesn’t Work Well: Optimization Perspective

When OLS Doesn’t Work Well: High Variance

When OLS Doesn’t Work Well: Large-\(p\)-small-\(n\) (High Dimensional Data)

Remedy for Large Variance and Large-\(p\)-small-\(n\)

Ridge Regression

Properties of Ridge Regression

Ridge Penalty

Geometrical Meaning of Ridge Regression

More Convex Loss Function

The Bias-variance Tradeoff

Lower Test MSE

`MASS::lm.ridge()`

Scaling Issues of `lm.ridge()`

Scaling Issues of `lm.ridge()`

Scaling Issues of `lm.ridge()`

Ridge Trace

Methods for Choosing \(\lambda\)

Cross Validation

\(k\)-Fold Cross Validation

`glmnet` package

\(k\)-Fold Cross Validation using `cv.glmnet()`¹

Determine \(\lambda\)

Generalized Cross-Validation

Generalized Cross-Validation

Ridge Regression and Cross Validation

Ridge Regression

When Ordinary Least Squares (OLS) Doesn’t Work Well: Collinearity

When OLS Doesn’t Work Well: Optimization Perspective

When OLS Doesn’t Work Well: Optimization Perspective

When OLS Doesn’t Work Well: High Variance

When OLS Doesn’t Work Well: Large-\(p\)-small-\(n\) (High Dimensional Data)

Remedy for Large Variance and Large-\(p\)-small-\(n\)

Ridge Regression

Properties of Ridge Regression

Ridge Penalty

Geometrical Meaning of Ridge Regression

More Convex Loss Function

The Bias-variance Tradeoff

Lower Test MSE

MASS::lm.ridge()

Scaling Issues of lm.ridge()

Scaling Issues of lm.ridge()

Scaling Issues of lm.ridge()

Ridge Trace

Methods for Choosing \(\lambda\)

Cross Validation

\(k\)-Fold Cross Validation

glmnet package

\(k\)-Fold Cross Validation using cv.glmnet()1

Determine \(\lambda\)

Generalized Cross-Validation

Generalized Cross-Validation

`MASS::lm.ridge()`

Scaling Issues of `lm.ridge()`

Scaling Issues of `lm.ridge()`

Scaling Issues of `lm.ridge()`

`glmnet` package

\(k\)-Fold Cross Validation using `cv.glmnet()`¹