Feature Selection and LASSO

MSSC 6250 Statistical Machine Learning

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Feature/Variable Selection

When OLS Doesn’t Work Well

When \(p \gg n\), it is often that many of the features in the model are not associated with the response.

Model Interpretability: By removing irrelevant features \(X_j\)s, i.e., setting the corresponding \(\beta_j\)s to zero, we can obtain a model that is more easily interpreted. (Feature/Variable selection)
Least squares is unlikely to yield any coefficient estimates that are exactly zero.

Variable Selection

We have a large pool of candidate regressors, of which only a few are likely to be important.

Two “conflicting” goals in model building:

as many features as possible for better predictive performance on new data (smaller bias).
as few regressors as possible because as the number of regressors increases,
- \(\mathrm{Var}(\hat{y})\) will increase (larger variance)
- cost more in data collecting and maintaining
- more model complexity

A compromise between the two hopefully leads to the “best” regression equation.

What does best mean?

There is no unique definition of “best”, and different methods specify different subsets of the candidate regressors as best.

Three Classes of Methods

Subset Selection (ISL Sec. 6.1) Identify a subset of the \(p\) predictors that we believe to be related to the response.
- Need a selection method and a selection criterion.
Shrinkage (ISL Sec. 6.2) Fit a model that forces some coefficients to be shrunk to zero.
- Lasso glmnet::glmnet()
Dimension Reduction (ISL Sec. 6.3) Find \(m\) representative features that are linear combinations of the \(p\) original predictors (\(m \ll p\)), then fit least squares.
- Principal component regression (Unsupervised) pls::pcr()
- Partial least squares (Supervised) pls::plsr()

Subset Selection

MSSC 5780 slides and ISL Sec. 6.1

Identify a subset of all the candidate predictors with the best OLS performance.
- Best subset selection olsrr::ols_step_best_subset()
- Forward stepwise selection olsrr::ols_step_forward_p()
- Backward stepwise elimination olsrr::ols_step_backward_p()
- Hybrid stepwise selection olsrr::ols_step_both_p()

Selection Criteria

An evaluation metric should consider Goodness of Fit and Model Complexity:

Goodness of Fit: The more regressors, the better

Complexity Penalty: The less regressors, the better

Evaluate subset models:
- \(R_{adj}^2\) \(\uparrow\)
- Mallow’s \(C_p\) \(\downarrow\)
- Information Criterion (AIC, BIC) \(\downarrow\)
- PREdiction Sum of Squares (PRESS) \(\downarrow\) (Allen, D.M. (1974))

Lasso (Least Absolute Shrinkage and Selection Operator)

Why Lasso?

Subset selection methods
- may be computationally infeasible (fit OLS over millions of times)
- do not explore all possible subset models (no global solution)

Ridge regression does shrink coefficients, but still include all predictors.
Lasso regularizes coefficients so that some coefficients are shrunk to zero, doing feature selection.
Like ridge regression, for a given \(\lambda\), Lasso only fits a single model.

What is Lasso?

Different from the Ridge regression that adds \(\ell_2\) norm, Lasso adds \(\ell_1\) penalty on the parameters:

\[\begin{align} \widehat{\boldsymbol \beta}^\text{l} =& \, \, \mathop{\mathrm{arg\,min}}_{\boldsymbol \beta} \lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2 + n \lambda \lVert\boldsymbol \beta\rVert_1\\ =& \, \, \mathop{\mathrm{arg\,min}}_{\boldsymbol \beta} \frac{1}{n} \sum_{i=1}^n (y_i - x_i' \boldsymbol \beta)^2 + \lambda \sum_{j=1}^p |\beta_j|, \end{align}\]

The \(\ell_1\) penalty forces some of the coefficient estimates to be exactly equal to zero when \(\lambda\) is sufficiently large, yielding sparse models.

\(\ell_2\) vs. \(\ell_1\)

Ridge shrinks big coefficients much more than lasso.
Lasso has larger penalty on small coefficients.

`ElemStatLearn::prostate` Data

lcavol	lweight	age	lbph	lcp	gleason	pgg45	lpsa	train
-0.580	2.77	50	-1.386	-1.39	6	0	-0.431	TRUE
-0.994	3.32	58	-1.386	-1.39	6	0	-0.163	TRUE
-0.511	2.69	74	-1.386	-1.39	7	20	-0.163	TRUE
-1.204	3.28	58	-1.386	-1.39	6	0	-0.163	TRUE
0.751	3.43	62	-1.386	-1.39	6	0	0.372	TRUE
-1.050	3.23	50	-1.386	-1.39	6	0	0.765	TRUE
0.737	3.47	64	0.615	-1.39	6	0	0.765	FALSE
0.693	3.54	58	1.537	-1.39	6	0	0.854	TRUE

`cv.glmnet(alpha = 1)`

lasso_fit <- cv.glmnet(x = data.matrix(prostate[, 1:8]), y = prostate$lpsa, nfolds = 10, 
                       alpha = 1)

Code

    plot(lasso_fit)
    plot(lasso_fit$glmnet.fit, "lambda")

Lasso Coefficients

lambda.min contains more nonzero coefficients.
Larger penalty \(\lambda\) forces more coefficients to be zero, and the model is more “sparse”.

coef(lasso_fit, s = "lambda.min")

9 x 1 sparse Matrix of class "dgCMatrix"
                  s1
(Intercept)  0.17999
lcavol       0.56099
lweight      0.61908
age         -0.02069
lbph         0.09531
svi          0.75180
lcp         -0.09912
gleason      0.04745
pgg45        0.00432

coef(lasso_fit, s = "lambda.1se")

9 x 1 sparse Matrix of class "dgCMatrix"
               s1
(Intercept) 0.644
lcavol      0.455
lweight     0.314
age         .    
lbph        .    
svi         0.367
lcp         .    
gleason     .    
pgg45       .

One-Variable Lasso and Shrinkage: Concept

Lasso solution does not have an analytic or closed form in general.
Consider the univariate regression model

\[\underset{\beta}{\mathop{\mathrm{arg\,min}}} \quad \frac{1}{n} \sum_{i=1}^n (y_i - x_i \beta)^2 + \lambda |\beta|\]

With some derivation, and also utilize the OLS solution of the loss function, we have

\[\begin{align} &\frac{1}{n} \sum_{i=1}^n (y_i - x_i \beta)^2 \\ =& \frac{1}{n} \sum_{i=1}^n (y_i - x_i b + x_i b - x_i \beta)^2 \\ =& \frac{1}{n} \sum_{i=1}^n \Big[ \underbrace{(y_i - x_i b)^2}_{\text{I}} + \underbrace{2(y_i - x_i b)(x_i b - x_i \beta)}_{\text{II}} + \underbrace{(x_i b - x_i \beta)^2}_{\text{III}} \Big] \end{align}\]

One-Variable Lasso and Shrinkage: Concept

\[\begin{align} & \sum_{i=1}^n 2(y_i - x_i b)(x_i b - x_i \beta) = (b - \beta) {\color{OrangeRed}{\sum_{i=1}^n 2(y_i - x_i b)x_i}} = (b - \beta) {\color{OrangeRed}{0}} = 0 \end{align}\]

Our original problem reduces to just the third term and the penalty

\[\begin{align} &\underset{\beta}{\mathop{\mathrm{arg\,min}}} \quad \frac{1}{n} \sum_{i=1}^n (x_ib - x_i \beta)^2 + \lambda |\beta| \\ =&\underset{\beta}{\mathop{\mathrm{arg\,min}}} \quad \frac{1}{n} \left[ \sum_{i=1}^n x_i^2 \right] (b - \beta)^2 + \lambda |\beta| \end{align} \] Without loss of generality, assume that \(x\) is standardized with mean 0 and variance \(\frac{1}{n}\sum_{i=1}^n x_i^2 = 1\).

One-Variable Lasso and Shrinkage: Concept

This leads to a general problem of

\[\underset{\beta}{\mathop{\mathrm{arg\,min}}} \quad (\beta - b)^2 + \lambda |\beta|,\] For \(\beta > 0\),

\[\begin{align} 0 =& \frac{\partial}{\partial \beta} \,\, \left[(\beta - b)^2 + \lambda |\beta| \right] = 2 (\beta - b) + \lambda \\ \Longrightarrow \quad \beta =&\, b - \lambda/2 \end{align}\]

\[\begin{align} \widehat{\beta}^\text{l} &= \begin{cases} b - \lambda/2 & \text{if} \quad b > \lambda/2 \\ 0 & \text{if} \quad |b| \le \lambda/2 \\ b + \lambda/2 & \text{if} \quad b < -\lambda/2 \\ \end{cases} \end{align}\]

Lasso provides a soft-thresholding solution.
When \(\lambda\) is large enough, \(\widehat{\beta}^\text{l}\) will be shrunk to zero.

Objective function

The objective function is \((\beta - 1)^2\). Once the penalty is larger than 2, the optimizer would stay at 0.

Variable Selection Property and Shrinkage

The proportion of times a variable has a non-zero parameter estimation.
\(\mathbf{y}= \mathbf{X}\boldsymbol \beta + \epsilon = \sum_{j = 1}^p X_j \times 0.4^{\sqrt{j}} + \epsilon\)
\(p = 20\), \(\epsilon \sim N(0, 1)\) and replicate 100 times.

Bias-Variance Trade-off

Comparing Lasso and Ridge

Constrained Optimization

For every value of \(\lambda\), there is some \(s\) such that the two optimization problems are equivalent, giving the same coefficient estimates.
The \(\ell_1\) and \(\ell_2\) penalties form a constraint region that \(\beta_j\) can move around or budget for how large \(\beta_j\) can be.
Larger \(s\) (smaller \(\lambda\)) means a larger region \(\beta_j\) can freely move.

Lasso \[\begin{align} \min_{\boldsymbol \beta} \,\,& \lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2 + n\lambda\lVert\boldsymbol \beta\rVert_1 \end{align}\]

\[\begin{align} \min_{\boldsymbol \beta} \,\,& \lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2\\ \text{s.t.} \,\, & \sum_{j=1}^p|\beta_j| \leq s \end{align}\]

Ridge \[\begin{align} \min_{\boldsymbol \beta} \,\,& \lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2 + n\lambda\lVert\boldsymbol \beta\rVert_2^2 \end{align}\]

\[\begin{align} \min_{\boldsymbol \beta} \,\,& \lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2\\ \text{s.t.} \,\, & \sum_{j=1}^p \beta_j^2 \leq s \end{align}\]

Geometric Representation of Optimization

What do the constraints look like geometrically?

When \(p = 2\),

the \(\ell_1\) constraint is \(|\beta_1| + |\beta_2| \leq s\) (diamond)
the \(\ell_2\) constraint is \(\beta_1^2 + \beta_2^2 \leq s\) (circle)

Source: https://stats.stackexchange.com/questions/350046/the-graphical-intuiton-of-the-lasso-in-case-p-2

Way of Shrinking (\(p = 1\) and standardized \(x\))

Lasso Soft-thresholding

\[\begin{align} \widehat{\beta}^\text{l} &= \begin{cases} b - \lambda/2 & \text{if} \quad b > \lambda/2 \\ 0 & \text{if} \quad |b| < \lambda/2 \\ b + \lambda/2 & \text{if} \quad b < -\lambda/2 \\ \end{cases} \end{align}\]

Ridge Proportional shrinkage

\[\begin{align} \widehat{\beta}^\text{r} = \dfrac{b}{1+\lambda}\end{align}\]

Predictive Performance

Perform well (lower test MSE) when

Lasso (—)

A relatively small number of \(\beta_j\)s are substantially large, and the remaining \(\beta_k\)s are small or equal to zero.
Reduce more bias

Ridge (\(\cdots\cdots\))

The response is a function of many predictors, all with coefficients of roughly equal size.
Reduce more variance

Notes of Lasso

Warning

Even Lasso does feature selection, do not add predictors that are known to be not associated with the response in any way.
Curse of dimensionality. The test MSE tends to increase as the dimensionality \(p\) increases, unless the additional features are truly associated with the response.
Do not conclude that the predictors with non-zero coefficients selected by Lasso and other selection methods predict the response more effectively than other predictors not included in the model.

Feature Selection and LASSO

Feature/Variable Selection

When OLS Doesn’t Work Well

Variable Selection

Three Classes of Methods

Subset Selection

Selection Criteria

Lasso (Least Absolute Shrinkage and Selection Operator)

Why Lasso?

What is Lasso?

\(\ell_2\) vs. \(\ell_1\)

`ElemStatLearn::prostate` Data

`cv.glmnet(alpha = 1)`

Lasso Coefficients

One-Variable Lasso and Shrinkage: Concept

One-Variable Lasso and Shrinkage: Concept

One-Variable Lasso and Shrinkage: Concept

Objective function

Variable Selection Property and Shrinkage

Bias-Variance Trade-off

Comparing Lasso and Ridge

Constrained Optimization

Geometric Representation of Optimization

Way of Shrinking (\(p = 1\) and standardized \(x\))

Predictive Performance

Notes of Lasso

Other Topics

Other Topics

Feature Selection and LASSO

Feature/Variable Selection

When OLS Doesn’t Work Well

Variable Selection

Three Classes of Methods

Subset Selection

Selection Criteria

Lasso (Least Absolute Shrinkage and Selection Operator)

Why Lasso?

What is Lasso?

\(\ell_2\) vs. \(\ell_1\)

ElemStatLearn::prostate Data

cv.glmnet(alpha = 1)

Lasso Coefficients

One-Variable Lasso and Shrinkage: Concept

One-Variable Lasso and Shrinkage: Concept

One-Variable Lasso and Shrinkage: Concept

Objective function

Variable Selection Property and Shrinkage

Bias-Variance Trade-off

Comparing Lasso and Ridge

Constrained Optimization

Geometric Representation of Optimization

Way of Shrinking (\(p = 1\) and standardized \(x\))

Predictive Performance

Notes of Lasso

Other Topics

Other Topics

`ElemStatLearn::prostate` Data

`cv.glmnet(alpha = 1)`