MSSC 6250 Statistical Machine Learning
Linear models can describe non-linear relationship.
Feature engineering: When \(y\) and \(x\) are not linearly associated, we transform \(x\) (sometimes also \(y\)) so that \(y\) and the transformed \(x\) become linearly related.
Is polynomial regression a basis function approach?
Important
\([b_1(x), b_2(x), \dots , b_K(x)]\) are fixed and known.
Polynomial regression imposes a global structure on the non-linear function.
Piecewise regression allows structural changes in different parts of the range of \(x\)
\[y=\begin{cases} \beta_{01} + \beta_{11}x+ \beta_{21}x^2 +\epsilon & \quad \text{if } x < \xi\\ \beta_{02} + \beta_{12}x+ \beta_{22}x^2+\beta_{32}x^3+\epsilon & \quad \text{if } x \ge \xi \end{cases}\]
The joint points of pieces are called knots.
Using more knots leads to a more flexible piecewise polynomial.
With \(K\) different knots, how many different polynomials do we have?
\[y=\begin{cases} \beta_{01} + \beta_{11}x+ \epsilon & \text{if } x < 1936\\ \beta_{02} + \beta_{12}x+\epsilon & \text{if } 1936 \le x < 1960 \\ \beta_{03} + \beta_{13}x+\epsilon & \text{if } 1960 \le x < 1978 \\ \beta_{04} + \beta_{14}x+\epsilon & \text{if } 1978 \le x \end{cases}\]
Any issue of piecewise polynomials?
Piecewise polynomials + Continuity/Differentiability at knots
Splines of degree \(d\) are smooth piecewise polynomials of degree \(d\) with continuity in derivatives (smoothing) up to degree \(d-1\) at each knot.
How do we turn the piecewise regression of degree 1 into a regression spline?
\[y=\begin{cases} \beta_{01} + \beta_{11}x+ \epsilon & \quad \text{if } x < 1936\\ \beta_{02} + \beta_{12}x+\epsilon & \quad \text{if } 1936 \le x < 1960 \\ \beta_{03} + \beta_{13}x+\epsilon & \quad \text{if } 1960 \le x < 1978 \\ \beta_{04} + \beta_{14}x+\epsilon & \quad \text{if } 1978 \le x \end{cases}\]
\[\begin{align} \beta_{01} + \beta_{11}1936 &= \beta_{02} + \beta_{12}1936\\ \beta_{02} + \beta_{12}1960 &= \beta_{03} + \beta_{13}1960\\ \beta_{03} + \beta_{13}1978 &= \beta_{04} + \beta_{14}1978 \end{align}\]
splines::bs()
The function is continuous everywhere, also at knots \(\xi_1, \xi_2,\) and \(\xi_3\), i.e. \(f_{k}(\xi_k^-) = f_{k+1}(\xi_k^+)\).
Linear everywhere except \(\xi_1, \xi_2,\) and \(\xi_3\).
Has a different slope for each region.
For linear splines with 3 knots,
\[\begin{align} b_1(x) &= x\\ b_2(x) &= (x - \xi_1)_+\\ b_3(x) &= (x - \xi_2)_+\\ b_4(x) &= (x - \xi_3)_+ \end{align}\] where
\[(x - \xi_k)_+=\begin{cases} x - \xi_k & \quad \text{if }x > \xi_k\\ 0 & \quad \text{otherwise} \end{cases}\]
\[\begin{align} y &= \beta_0 + \beta_1 b_1(x) + \beta_2 b_2(x) + \beta_3 b_3(x) + \beta_4 b_4(x) + \epsilon\\ &=\beta_0 + \beta_1 x + \beta_2 (x - \xi_1)_+ + \beta_3 (x - \xi_2)_+ + \beta_4 (x - \xi_3)_+ + \epsilon \end{align}\]
With degree \(d\) and \(K\) knots, the regression spline with first \(d-1\) derivatives being continuous at the knots can be represented as
\[\begin{align} y &= \beta_0 + \beta_1 b_1(x) + \beta_2 b_2(x) + \dots + \beta_d b_d(x) + \beta_{d+1}b_{d+1}(x) + \dots + \beta_{d+K}b_{d+K}(x) + \epsilon \end{align}\] where
\(b_j(x) = x^j, j = 1, 2, \dots, d\)
\(b_{d+k} = (x - \xi_k)^{d}_+, k = 1, \dots, K\) where \[(x - \xi_k)^d_+=\begin{cases} (x - \xi_k)^d & \quad \text{if }x > \xi_k\\ 0 & \quad \text{otherwise} \end{cases}\]
Can you write the model with \(d = 2\) and \(K = 3\)?
\[\begin{align} y &= \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \sum_{k = 1}^K \beta_{k+3} (x - \xi_k)_+^3 + \epsilon \end{align}\]
In regression, (effective) degrees of freedom (df) can be viewed as the number of regression coeffcients that are freely to move.
It measures the model complexity/flexiblity. The larger df is, the more flexible the model is.
\[y=\begin{cases} \beta_{01} + \beta_{11}x+ \epsilon & \quad \text{if } x < 1936\\ \beta_{02} + \beta_{12}x+\epsilon & \quad \text{if } 1936 \le x < 1960 \\ \beta_{03} + \beta_{13}x+\epsilon & \quad \text{if } 1960 \le x < 1978 \\ \beta_{04} + \beta_{14}x+\epsilon & \quad \text{if } 1978 \le x \end{cases}\]
\[\begin{align} \beta_{01} + \beta_{11}1936 &= \beta_{02} + \beta_{12}1936\\ \beta_{02} + \beta_{12}1960 &= \beta_{03} + \beta_{13}1960\\ \beta_{03} + \beta_{13}1978 &= \beta_{04} + \beta_{14}1978 \end{align}\]
\[\text{df} = K + d + 1\]
So what is the degrees of freedom of cubic splines?
Natural splines generally produce more stable estimates at the boundaries.
Assuming linearity near the boundary is reasonable since there is less information available there.
Natural Cubic Spline (NCS) forces the 2nd and 3rd derivatives to be zero at the boundaries.
The constraints frees up 4 dfs.
The df of NCS is \(K\).
bs()
, ns()
)Our goal is inference or curve fitting rather than out of range prediction or forecasting.
Is there a method that we can select the number and location of knots automatically?
Consider this criterion for fitting a smooth function \(g(x)\) to some data:
\[\begin{equation} \min_{g} \sum_{i=1}^n \left( y_i - g(x_i) \right)^2 + \lambda \int g''(t)^2 \, dt \end{equation}\]
The first term is \(SS_{res}\), and tries to make \(g(x)\) match the data at each \(x_i\). (Goodness-of-fit)
The second term with \(\lambda \ge 0\) is a roughness penalty and controls how wiggly \(g(x)\) is.
Loss + Penalty
\[\begin{equation} \min_{g} \sum_{i=1}^n \left( y_i - g(x_i) \right)^2 + \lambda \int g''(t)^2 \, dt \end{equation}\]
How \(\lambda\) affects the shape of \(g(x)\)?
The smaller \(\lambda\) is, the more wiggly \(g(x)\) is, eventually interpolating \(y_i\) when \(\lambda = 0\).
As \(\lambda \rightarrow \infty\), \(g(x)\) becomes linear.
The function \(g\) that minimizes the objective is known as a smoothing spline.
Note
Linear regression
\(\hat{\mathbf{y}} = \bf H \mathbf{y}\) where \(\bf H\) is the hat matrix
The degrees of freedom \((\text{# coef})\) is \[p = \sum_{i=1}^n {\bf H}_{ii}\]
PRESS for model selection
\[\begin{align} PRESS &= \sum_{i = 1}^n \left( y_i - \hat{y}_i^{(-i)}\right)^2 \\ &= \sum_{i = 1}^n \left[ \frac{y_i - \hat{y}_i}{1 - {\bf H}_{ii}}\right]^2\end{align}\]
Smoothing splines
\(\hat{\mathbf{g}}_{\lambda} = {\bf S}_{\lambda} \mathbf{y}\) where \(\bf {\bf S}_{\lambda}\) is the smoother matrix.
The effective degrees of freedom \((\text{# coef})\) is \[df_{\lambda} = \sum_{i=1}^n \{{\bf S}_{\lambda}\}_{ii} \in [2, n]\]
LOOCV for choosing \(\lambda\)1
\[\begin{align} (SS_{res})_{CV}(\lambda) &= \sum_{i = 1}^n \left( y_i - \hat{g}_{\lambda}^{(-i)}(x_i)\right)^2 \\ &= \sum_{i = 1}^n \left[ \frac{y_i - \hat{g}_{\lambda}(x_i)}{1 - \{{\bf S}_{\lambda}\}_{ii}}\right]^2 \end{align}\]
smooth.spline()
cv = TRUE
use LOOCV; cv = FALSE
use GCVsmooth.spline()
All methods discussed so far are extensions of simple linear regression.
How to flexibly predict \(y\) or nonlinear regression function on the basis of several predictors \(x_1, \dots, x_p\)?
Maintaining the additive structure of linear models, but allowing nonlinear functions of each of the variables.
\[y = \beta_0 + f_1(x_1) + f_2(x_2) + \dots + f_p(x_p) + \epsilon\]
Wage
data in ISLRows: 3,000
Columns: 11
$ year <int> 2006, 2004, 2003, 2003, 2005, 2008, 2009, 2008, 2006, 2004,…
$ age <int> 18, 24, 45, 43, 50, 54, 44, 30, 41, 52, 45, 34, 35, 39, 54,…
$ maritl <fct> 1. Never Married, 1. Never Married, 2. Married, 2. Married,…
$ race <fct> 1. White, 1. White, 1. White, 3. Asian, 1. White, 1. White,…
$ education <fct> 1. < HS Grad, 4. College Grad, 3. Some College, 4. College …
$ region <fct> 2. Middle Atlantic, 2. Middle Atlantic, 2. Middle Atlantic,…
$ jobclass <fct> 1. Industrial, 2. Information, 1. Industrial, 2. Informatio…
$ health <fct> 1. <=Good, 2. >=Very Good, 1. <=Good, 2. >=Very Good, 1. <=…
$ health_ins <fct> 2. No, 2. No, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Ye…
$ logwage <dbl> 4.32, 4.26, 4.88, 5.04, 4.32, 4.85, 5.13, 4.72, 4.78, 4.86,…
$ wage <dbl> 75.0, 70.5, 131.0, 154.7, 75.0, 127.1, 169.5, 111.7, 118.9,…
gam::gam()
s()
for smoothing splines
Coefficients not that interesting; fitted functions are.