Splines and General Additive Models 〰️

MSSC 6250 Statistical Machine Learning

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Nonlinear Relationships

When \(y\) and \(x\) are Not Linearly Associated

Linear models can describe non-linear relationship.
Feature engineering: When \(y\) and \(x\) are not linearly associated, we transform \(x\) (sometimes also \(y\)) so that \(y\) and the transformed \(x\) become linearly related.

Popular methods
- Basis function approach
  - Polynominal regression (MSSC 5780, ISL 7.1)
  - Piecewise regression (MSSC 5780, ISL 7.3, 7.4)
  - Regression splines
- Local regression (MSSC 5780, ISL 7.6)
- Smoothing splines
- General additive models (GAMs)

Polynomial Regression

The \(d\)th-order (degree) polynomial model in one variable is \[y = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_dx^d + \epsilon\]

We extend the simple linear regression by mapping \(x\) to \((x, x^2, \dots, x^d)\).

Basis Function Approach

A set of basis functions or transformations that can be applied to a variable \(x\):\[[b_1(x), b_2(x), \dots , b_K(x)]\]
(Adaptive) basis function model: \[y = \beta_0 + \beta_1 b_1(x) + \beta_2 b_2(x) + \cdots + \beta_K b_K(x) + \epsilon\]

Is polynomial regression a basis function approach?

Polynomial basis: \(\phi(x) = [1, x, x^2, \dots, x^K]\)
Fourier series basis: \(\phi(x) = [1, \cos(\omega_1x + \psi_1), \cos(\omega_2x + \psi_2), \dots]\)
B-Spline basis

Important

\([b_1(x), b_2(x), \dots , b_K(x)]\) are fixed and known.

Piecewise regression

Polynomial regression imposes a global structure on the non-linear function.
Piecewise regression allows structural changes in different parts of the range of \(x\)

\[y=\begin{cases} \beta_{01} + \beta_{11}x+ \beta_{21}x^2 +\epsilon & \quad \text{if } x < \xi\\ \beta_{02} + \beta_{12}x+ \beta_{22}x^2+\beta_{32}x^3+\epsilon & \quad \text{if } x \ge \xi \end{cases}\]

The joint points of pieces are called knots.
Using more knots leads to a more flexible piecewise polynomial.

With \(K\) different knots, how many different polynomials do we have?

U.S. Birth Rate from 1917 to 2003¹

     Year Birthrate
1917 1917       183
1918 1918       184
1919 1919       163
1920 1920       180
1921 1921       181
1922 1922       173
1923 1923       168
1924 1924       177
1925 1925       172
1926 1926       170
1927 1927       164
1928 1928       152
1929 1929       145
1930 1930       145
1931 1931       139
1932 1932       132
1933 1933       126
1934 1934       130
1935 1935       130

A Polynomial Regression Provide a Poor Fit

lmfit3 <- lm(Birthrate ~ poly(Year - mean(Year), degree = 3, 
                              raw = TRUE),  
             data = birthrates)

Piecewise Polynomials: 3 knots at 1936, 60, 78

\[y=\begin{cases} \beta_{01} + \beta_{11}x+ \epsilon & \text{if } x < 1936\\ \beta_{02} + \beta_{12}x+\epsilon & \text{if } 1936 \le x < 1960 \\ \beta_{03} + \beta_{13}x+\epsilon & \text{if } 1960 \le x < 1978 \\ \beta_{04} + \beta_{14}x+\epsilon & \text{if } 1978 \le x \end{cases}\]

Any issue of piecewise polynomials?

Regression Splines

Piecewise polynomials + Continuity/Differentiability at knots

Splines

Splines of degree \(d\) are smooth piecewise polynomials of degree \(d\) with continuity in derivatives (smoothing) up to degree \(d-1\) at each knot.

How do we turn the piecewise regression of degree 1 into a regression spline?

\[y=\begin{cases} \beta_{01} + \beta_{11}x+ \epsilon & \quad \text{if } x < 1936\\ \beta_{02} + \beta_{12}x+\epsilon & \quad \text{if } 1936 \le x < 1960 \\ \beta_{03} + \beta_{13}x+\epsilon & \quad \text{if } 1960 \le x < 1978 \\ \beta_{04} + \beta_{14}x+\epsilon & \quad \text{if } 1978 \le x \end{cases}\]

For splines of degree 1, we require continuous piecewise linear function

\[\begin{align} \beta_{01} + \beta_{11}1936 &= \beta_{02} + \beta_{12}1936\\ \beta_{02} + \beta_{12}1960 &= \beta_{03} + \beta_{13}1960\\ \beta_{03} + \beta_{13}1978 &= \beta_{04} + \beta_{14}1978 \end{align}\]

`splines::bs()`

lin_sp <- lm(Birthrate ~ splines::bs(Year, degree = 1, knots = c(1936, 1960, 1978)), 
             data = birthrates)

The function is continuous everywhere, also at knots \(\xi_1, \xi_2,\) and \(\xi_3\), i.e. \(f_{k}(\xi_k^-) = f_{k+1}(\xi_k^+)\).
Linear everywhere except \(\xi_1, \xi_2,\) and \(\xi_3\).
Has a different slope for each region.

Splines as Basis Function Model

For linear splines with 3 knots,

\[\begin{align} b_1(x) &= x\\ b_2(x) &= (x - \xi_1)_+\\ b_3(x) &= (x - \xi_2)_+\\ b_4(x) &= (x - \xi_3)_+ \end{align}\] where

\((x - \xi_k)_+\) is a truncated power basis function (of power one) such that

\[(x - \xi_k)_+=\begin{cases} x - \xi_k & \quad \text{if }x > \xi_k\\ 0 & \quad \text{otherwise} \end{cases}\]

\[\begin{align} y &= \beta_0 + \beta_1 b_1(x) + \beta_2 b_2(x) + \beta_3 b_3(x) + \beta_4 b_4(x) + \epsilon\\ &=\beta_0 + \beta_1 x + \beta_2 (x - \xi_1)_+ + \beta_3 (x - \xi_2)_+ + \beta_4 (x - \xi_3)_+ + \epsilon \end{align}\]

Linear Splines Basis Functions

Splines as Basis Function Model¹

With degree \(d\) and \(K\) knots, the regression spline with first \(d-1\) derivatives being continuous at the knots can be represented as

\[\begin{align} y &= \beta_0 + \beta_1 b_1(x) + \beta_2 b_2(x) + \dots + \beta_d b_d(x) + \beta_{d+1}b_{d+1}(x) + \dots + \beta_{d+K}b_{d+K}(x) + \epsilon \end{align}\] where

\(b_j(x) = x^j, j = 1, 2, \dots, d\)
\(b_{d+k} = (x - \xi_k)^{d}_+, k = 1, \dots, K\) where \[(x - \xi_k)^d_+=\begin{cases} (x - \xi_k)^d & \quad \text{if }x > \xi_k\\ 0 & \quad \text{otherwise} \end{cases}\]

Can you write the model with \(d = 2\) and \(K = 3\)?

Cubic Splines¹

The cubic spline is a spline of degree 3 with first 2 derivatives are continuous at the knots.

\[\begin{align} y &= \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \sum_{k = 1}^K \beta_{k+3} (x - \xi_k)_+^3 + \epsilon \end{align}\]

Smooth at knots because
- \(f(\xi_k^-) = f(\xi_k^+)\)
- \(f'(\xi_k^-) = f'(\xi_k^+)\)
- \(f''(\xi_k^-) = f''(\xi_k^+)\)
- But \(f^{(3)}(\xi_k^-) \ne f^{(3)}(\xi_k^+)\)

Degrees of Freedom

In regression, (effective) degrees of freedom (df) can be viewed as the number of regression coeffcients that are freely to move.
It measures the model complexity/flexiblity. The larger df is, the more flexible the model is.

Piecewise linear: 3 knots, 4 linear regressions, and 8 dfs.

Linear splines (continuous piecewise linear): 8 - 3 constraints = 5 dfs. (Check its basis representation)

Degrees of Freedom

With degree \(d\) and \(K\) knots,

\[\text{df} = K + d + 1\]

So what is the degrees of freedom of cubic splines?

Cubic Splines

cub_sp <- lm(Birthrate ~ splines::bs(Year, degree = 3, knots = c(1936, 1960, 1978)), 
             data = birthrates)

Natural Splines

Splines¹ tends to produce erratic and undesirable results near the boundaries, and huge variance at the outer range of the predictors.

A natural spline is a regression spline with additional boundary constraints:
- The spline function is linear at the boundary (in the region where \(X\) is smaller than the smallest knot, or larger than the largest knot).

Natural splines generally produce more stable estimates at the boundaries.
Assuming linearity near the boundary is reasonable since there is less information available there.

splines::ns()

Natural Cubic Spline (NCS) forces the 2nd and 3rd derivatives to be zero at the boundaries.
The constraints frees up 4 dfs.
The df of NCS is \(K\).

Spline and Natural Spline Comparison

Cubic Spline vs. Natural Cubic Spline with the same degrees of freedom 6.

Practical Issues

How many knots should we use
- As few knots as possible, with at least 5 data points per segment (Wold, 1974)
- Cross-validation: e.g., choose the \(K\) giving the smallest \(MSE_{CV}\)

Where to place the knots
- No more than one extreme or inflection point per segment (Wold, 1974)
- If possible, the extreme points should be centered in the segment
- More knots in places where the function might vary most rapidly, and fewer knots where it seems more stable
- Place knots in a uniform fashion
- Specify the desired df, and have the software place the corresponding number of knots at uniform quantiles of the data (bs(), ns())

Degree of functions in each region
- Use \(d < 4\) to avoid overly flexible curve fitting
- Cubic spline is popular

Smoothing Splines

Roughness Penalty Approach

Our goal is inference or curve fitting rather than out of range prediction or forecasting.
Is there a method that we can select the number and location of knots automatically?

Consider this criterion for fitting a smooth function \(g(x)\) to some data:

\[\begin{equation} \min_{g} \sum_{i=1}^n \left( y_i - g(x_i) \right)^2 + \lambda \int g''(t)^2 \, dt \end{equation}\]

The first term is \(SS_{res}\), and tries to make \(g(x)\) match the data at each \(x_i\). (Goodness-of-fit)
The second term with \(\lambda \ge 0\) is a roughness penalty and controls how wiggly \(g(x)\) is.

Loss + Penalty

Roughness Penalty Approach

\[\begin{equation} \min_{g} \sum_{i=1}^n \left( y_i - g(x_i) \right)^2 + \lambda \int g''(t)^2 \, dt \end{equation}\]

How \(\lambda\) affects the shape of \(g(x)\)?

The smaller \(\lambda\) is, the more wiggly \(g(x)\) is, eventually interpolating \(y_i\) when \(\lambda = 0\).
As \(\lambda \rightarrow \infty\), \(g(x)\) becomes linear.
The function \(g\) that minimizes the objective is known as a smoothing spline.

Note

The solution \(\hat{g}\) is a natural cubic spline, with knots at \(x_1, x_2, \dots , x_n\)!
But, it is not the same NCS gotten from the basis function approach.
It is a shrunken version where \(\lambda\) controls the level of shrinkage.
Its effective df is less than \(n\).

Properties of Smoothing splines

Smoothing splines avoid the knot-selection issue, leaving a single \(\lambda\) to be chosen.

Linear regression

\(\hat{\mathbf{y}} = \bf H \mathbf{y}\) where \(\bf H\) is the hat matrix
The degrees of freedom \((\text{# coef})\) is \[p = \sum_{i=1}^n {\bf H}_{ii}\]
PRESS for model selection

\[\begin{align} PRESS &= \sum_{i = 1}^n \left( y_i - \hat{y}_i^{(-i)}\right)^2 \\ &= \sum_{i = 1}^n \left[ \frac{y_i - \hat{y}_i}{1 - {\bf H}_{ii}}\right]^2\end{align}\]

Smoothing splines

\(\hat{\mathbf{g}}_{\lambda} = {\bf S}_{\lambda} \mathbf{y}\) where \(\bf {\bf S}_{\lambda}\) is the smoother matrix.
The effective degrees of freedom \((\text{# coef})\) is \[df_{\lambda} = \sum_{i=1}^n \{{\bf S}_{\lambda}\}_{ii} \in [2, n]\]
LOOCV for choosing \(\lambda\)¹

\[\begin{align} (SS_{res})_{CV}(\lambda) &= \sum_{i = 1}^n \left( y_i - \hat{g}_{\lambda}^{(-i)}(x_i)\right)^2 \\ &= \sum_{i = 1}^n \left[ \frac{y_i - \hat{g}_{\lambda}(x_i)}{1 - \{{\bf S}_{\lambda}\}_{ii}}\right]^2 \end{align}\]

`smooth.spline()`

smooth.spline(x, y, df, lambda, cv = FALSE)

cv = TRUE use LOOCV; cv = FALSE use GCV

fit <- smooth.spline(birthrates$Year, birthrates$Birthrate)

fit$df

[1] 60.8

`smooth.spline()`

fit <- smooth.spline(birthrates$Year, birthrates$Birthrate,
                     df = 15)

Generalized Additive Models

Extending Splines to Multiple Variables

All methods discussed so far are extensions of simple linear regression.
How to flexibly predict \(y\) or nonlinear regression function on the basis of several predictors \(x_1, \dots, x_p\)?

Maintaining the additive structure of linear models, but allowing nonlinear functions of each of the variables.

\[y = \beta_0 + f_1(x_1) + f_2(x_2) + \dots + f_p(x_p) + \epsilon\]

This is a generalized additive model.
- It’s general: it can be modelling response with other distributions, binary, counts, positive values, for example.
- It’s additive: calculate a separate \(f_j\) for each \(x_j\), and add together all of their contributions.

\(f_1(x_1) = x_1\); \(f_2(x_2) = x_2 + x_2^2\); \(f_3(x_3) = \text{cubic splines}\)

`Wage` data in ISL

library(ISLR2)
attach(Wage)
dplyr::glimpse(Wage)

Rows: 3,000
Columns: 11
$ year       <int> 2006, 2004, 2003, 2003, 2005, 2008, 2009, 2008, 2006, 2004,…
$ age        <int> 18, 24, 45, 43, 50, 54, 44, 30, 41, 52, 45, 34, 35, 39, 54,…
$ maritl     <fct> 1. Never Married, 1. Never Married, 2. Married, 2. Married,…
$ race       <fct> 1. White, 1. White, 1. White, 3. Asian, 1. White, 1. White,…
$ education  <fct> 1. < HS Grad, 4. College Grad, 3. Some College, 4. College …
$ region     <fct> 2. Middle Atlantic, 2. Middle Atlantic, 2. Middle Atlantic,…
$ jobclass   <fct> 1. Industrial, 2. Information, 1. Industrial, 2. Informatio…
$ health     <fct> 1. <=Good, 2. >=Very Good, 1. <=Good, 2. >=Very Good, 1. <=…
$ health_ins <fct> 2. No, 2. No, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Ye…
$ logwage    <dbl> 4.32, 4.26, 4.88, 5.04, 4.32, 4.85, 5.13, 4.72, 4.78, 4.86,…
$ wage       <dbl> 75.0, 70.5, 131.0, 154.7, 75.0, 127.1, 169.5, 111.7, 118.9,…

`gam::gam()`

library(gam)
gam.m3 <- gam(wage ~ s(year, df = 4) + s(age , df = 5) + education,
              data = Wage)

s() for smoothing splines
Coefficients not that interesting; fitted functions are.

upper and lower pointwise twice-standard-error curves are included

The left-hand panel indicates that holding age and education fixed, wage tends to increase slightly with year; this may be due to inflation. The center panel indicates that holding education and year fixed, wage tends to be highest for intermediate values of age, and lowest for the very young and very old.

Because the model is additive, we can examine the effect of each \(X_j\) on \(Y\) individually while holding all of the other variables fixed.
Allow to fit a non-linear \(f_j\) to each \(X_j\), so that we can automatically model non-linear relationships that standard linear regression will miss.
No need to manually try out many different transformations on each variable individually.
The non-linear fits can potentially make more accurate predictions for the response.
The smoothness of the function \(f_j\) for the variable \(X_j\) can be summarized via degrees of freedom.
The model is additive. For fully general models, we look for more flexible approaches such as random forests and boosting.

Splines and General Additive Models 〰️

Nonlinear Relationships

When \(y\) and \(x\) are Not Linearly Associated

Polynomial Regression

Basis Function Approach

Piecewise regression

U.S. Birth Rate from 1917 to 20031

A Polynomial Regression Provide a Poor Fit

Piecewise Polynomials: 3 knots at 1936, 60, 78

Regression Splines

Splines

splines::bs()

Splines as Basis Function Model

Linear Splines Basis Functions

Splines as Basis Function Model1

Cubic Splines1

Degrees of Freedom

Degrees of Freedom

Cubic Splines

Natural Splines

Spline and Natural Spline Comparison

Practical Issues

Smoothing Splines

Roughness Penalty Approach

Roughness Penalty Approach

Properties of Smoothing splines

smooth.spline()

smooth.spline()

Generalized Additive Models

Extending Splines to Multiple Variables

Wage data in ISL

gam::gam()

U.S. Birth Rate from 1917 to 2003¹

`splines::bs()`

Splines as Basis Function Model¹

Cubic Splines¹

`smooth.spline()`

`smooth.spline()`

`Wage` data in ISL

`gam::gam()`