Dimension Reduction

MSSC 6250 Statistical Machine Learning

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Unsupervised Learning

Supervised Learning: response $Y$ and features $X_1, X_2, \dots, X_p$ measured on $n$ observations.
Unsupervised Learning: only features $X_1, X_2, \dots, X_p$ measured on $n$ observations.
- Not interested in prediction (no response to be predicted)
- Discover any interesting pattern or relationships among these features.

Estimate the density, covariance, graph (network), etc. of $\mathbf{X}$

Dimension reduction for effective data visualization or extracting most important information those features contain.
- plot a bunch of points of $\boldsymbol{x} = (x_1, x_2, \dots, x_p)$ in a 2-D scatter plot (manifold). (reduce dimension from $p$ to 2)
- use 2 variables to explain most variations or represents high data density in the $p$ variables

Clustering discovers unknown subgroups/clusters in data
- find 3 sub-groups of people based on variables income, occupation, age, etc

Background: Dimensions

One-Dimension (1D) Number line

# A tibble: 50 × 1
   English
     <int>
 1      41
 2      65
 3      55
 4      94
 5      66
 6      85
 7      44
 8      44
 9      67
10      73
# ℹ 40 more rows

One-Dimension (1D) Number line: Uniform students

# A tibble: 50 × 1
   English
     <int>
 1      41
 2      65
 3      55
 4      94
 5      66
 6      85
 7      44
 8      44
 9      67
10      73
# ℹ 40 more rows

1D Number line: Non-uniform students

# A tibble: 50 × 1
   English
     <int>
 1      77
 2      78
 3      81
 4      78
 5      52
 6      62
 7      47
 8      58
 9      43
10      59
# ℹ 40 more rows

Two-Dimensions (2D) X-Y Scatter plot: High Correlated

English and Math measure an overall academic performance.

# A tibble: 50 × 2
   English  Math
     <int> <dbl>
 1      41  33.2
 2      65  63.6
 3      55  44.6
 4      94  95  
 5      66  65.6
 6      85  73.1
 7      44  46.6
 8      44  51  
 9      67  69.4
10      73  66.5
# ℹ 40 more rows

Two-Dimensions (2D) X-Y Scatter plot: No correlated

English and Math measure different abilities.

# A tibble: 50 × 2
   English  Math
     <int> <int>
 1      41    27
 2      65    47
 3      55    32
 4      94    44
 5      66    20
 6      85    30
 7      44    16
 8      44    72
 9      67    86
10      73    82
# ℹ 40 more rows

Three-Dimensions (3D) X-Y-Z Scatter plot

# A tibble: 50 × 3
   English  Math Biology
     <int> <dbl>   <dbl>
 1      41  33.2    39.2
 2      65  63.6    61.6
 3      55  44.6    41.6
 4      94  95      92  
 5      66  65.6    73.6
 6      85  73.1    71.1
 7      44  46.6    56.6
 8      44  51      56  
 9      67  69.4    79.4
10      73  66.5    59.5
# ℹ 40 more rows

Four-Dimension (4D) X-Y-Z-? Scatter plot

# A tibble: 50 × 4
   English  Math Biology History
     <int> <dbl>   <dbl>   <dbl>
 1      41  33.2    39.2      51
 2      65  63.6    61.6      53
 3      55  44.6    41.6      63
 4      94  95      92        83
 5      66  65.6    73.6      51
 6      85  73.1    71.1      74
 7      44  46.6    56.6      34
 8      44  51      56        33
 9      67  69.4    79.4      76
10      73  66.5    59.5      74
# ℹ 40 more rows

How about Pair Plots?

Tooooo Many Pair Plots!

If we have $p$ variables, there are ${p \choose 2} = p(p-1)/2$ pairs.
If $p = 10$, we have 45 such scatter plots to look at!
In practice, we may encounter over 100 variables!!

Dimension Reduction

One variable represents one dimension.
With many variables in the data, we live in a high dimensional world.

GOAL:

Find a low-dimensional (usually 2D) representation of the data that captures as much of the information all of those variables provide as possible.
Use two created variables to represent all $p$ variables, and make a scatter plot of the two created variables to learn what our observations look like as if they lived in the high dimensional space.

Why and when can we omit dimensions?

So in order to meaningfully represent the relationship of all variables, we need a technique, Dimension Reduction.
In mathematics, One variable represents one dimension, so with many variables in the data, we live in a high dimensional world.
We would like to find a low-dimensional (usually 2D) representation of the data that captures as much of the information all of those variables provide as possible.
We use two created variables to represent all $p$ variables, and make a scatter plot of the two created variables to learn what our observations look like as if they lived in the high dimensional space.
Of course, it’s not always a good idea to just use two variables to represent all $p$ variables. Some information will be missing when we just use two variables to represent or explain all the relationships of $p$ variables.
But sometimes, a low-dimensional representation looks very like a high dimensional space, and does not lose much information. In this situation, a low-dimensional representation is very useful.
So Let’s see why and when can we omit dimensions?

Variation mostly from One Variable

Almost all of the variation in the data is from left to right.

Variation mostly from One Variable

If we flattened the data, the graph would not look much different.

Variation mostly from One Variable

If we flattened the data, we could graph it with a 1D number line!

Variation mostly from One Variable

Both graphs say “the important variation is left to right.”

Principal Component Analysis (PCA)

Idea of PCA

PCA is a dimension reduction tool that finds a low-dimensional representation of a data set that contains as much as possible of variation.
Each observation lives in a high-dimensional space (lots of variables), but not all of these dimensions (variables) are equally interesting/important.
The concept of interesting/important is measured by the amount that the observations vary along each dimension.

PCA is a dimension reduction tool that finds a low-dimensional representation of a data set that contains as much as possible of variation stored in the data set.
As we’ve seen before, each of the observations lives in a high-dimensional space, meaning that each observation has lots of variables associated with it, but not all of these dimensions (variables) are equally interesting/important.
The concept of interesting/important is measured by the amount that the observations vary along each dimension.
A characteristic or attribute of observations is called variable because its value varies from sample to sample.
If the variable does not vary, it becomes an irrelevant or un-important variable because we cannot use the variable to differentiate or distinguish observations. If everyone in this class gets grade A, then the data science grade is not an important variable to learn which students perform academically better than others.

PCA Illustration: 2 Variable Example

# A tibble: 50 × 2
   English  Math
     <int> <dbl>
 1      41  33.2
 2      65  63.6
 3      55  44.6
 4      94  95  
 5      66  65.6
 6      85  73.1
 7      44  46.6
 8      44  51  
 9      67  69.4
10      73  66.5
11      47  58.9
12      66  66.5
13      57  51  
14      57  33.2
15      77  83.6
16      83  78.8
# ℹ 34 more rows

Step 1: Shift (or standardize) the Data

So the two variables have both mean 0. If the variables are measured in a different unit, consider standardization, $\frac{x_i - \bar{x}}{s_x}$.
Shifting does not change how the data points are positioned relative to each other.

Step 2: Find a Line that Fits the Data the Best

Start with a line going through the origin.
Rotate the line until it fits the data as well as it can, given that it goes through the origin.

Step 2: Find a Line that Fits the Data the Best

Start with a line going through the origin.
Rotate the line until it fits the data as well as it can, given that it goes through the origin.

Step 2: Find a Line that Fits the Data the Best

Start with a line going through the origin.
Rotate the line until it fits the data as well as it can, given that it goes through the origin.

The Meaning of the Best line

Principal Component 1 (PC1): maximizes the variance of the projected points.
PC1 is the line in the Eng-Math space that is closest to the $n$ observations
- PC1 minimizes the sum of squared distances between the data points and the PC1.
PC1 is the best 1D representation of the 2D data

To quantify how good this line fits the data, PCA projects the data onto it.
Then the idea is that we can either measure the distances from the data to the line and try to find the line that minimizes those distances.
Or we can try to find the line that maximizes the distances from the projected points to the origin.
The two criteria are equivalent.
DEMO
So we are find the line that maximizes the variation of the projected points from the data points onto the line!
The best line is called Principal Component 1 (PC1), which maximizes the sum of squared distances between projected points and the origin.
Regression line: minimizes the sum of squared residuals (vertical lines from the data points to the line)
Principal Component 1 (PC1): maximizes the sum of squared distances between between projected points and the origin.
PC1 is the line in the Eng-Math 2 dimensional space that is closest to the $n$ observations, i.e., PC1 minimizes the sum of squared distances between the data points and the PC1.
PC1 is the best 1D representation of the 2D data

The Meaning of the Best line

Source: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

PC1 and PC2

The data points are also spread out a little above and below the PC1.
There are some variation that is not explained by the PC1.
Find the second PC, PC2, that
- explains the remaining variation
- is the line through the origin and perpendicular to PC1.

Linear Combinations

PC1 = 0.68 $\times$ English $+$ 0.74 $\times$ Math
PC2 = 0.74 $\times$ English $-$ 0.68 $\times$ Math
PC1 is like an overall intelligence index as it is a weighted average combining verbal and quantitative abilities.
PC2 accounts for individual difference in English and Math scores.
$0.68^2 + 0.74^2 = 1$ (Pythagorean theorem)
The combination weights 0.68, 0.74, etc are called PC loadings.

Now let’s look at PC1 and PC2 a little more carefully.
First PC1 and PC2 are just a vector in 2 dimensional space, right?
In other words, they are linear combinations of two standard basis, here our English axis and Math axis.
PCA help us get the linear combinations.
Here PC1 = 0.68 $\times$ English + 0.74 $\times$ Math
PC2 = 0.74 $\times$ English - 0.68 $\times$ Math
So to make PC1, we mix 0.68 part of English score with 0.74 parts of Math score.
One unit of PC1 consists of 0.68 parts of English and 0.74 parts of Math.
And because the weight of math score is a little bit larger, Math score is a little bit more important when it comes to describing how the data are spread out.
PC1 can be viewed as an overall intelligence index because it is an weighted average combining both verbal and quantitative reasoning abilities.
PC2 accounts for individual difference in English and Math scores.
$0.68^2 + 0.74^2 = 1$ (Pythagorean theorem)
The combination weights 0.68, 0.74, etc are called loadings of PC.

Rotate Everything so that PC1 is Horizontal

1D representation

PC1 is our 1D number line that explains the most variation contained in 2D data using a 1D line.
Points on the PC1 are the projected points of data onto PC1.

2D representation

The new coordinates PC1 and PC2 are ordered by variation size of the English and Math scores

Variation

If the variation for PC1 is $17$ and the variation for PC2 is $2$, the total variation presented in the data is $17+2=19$.
PC1 accounts for $17/19 = 89\%$ of the total variation, and PC2 accounts for $2/19 = 11\%$ of the total variation.

How about 3 or More Variables?

PC1 spans the direction of the most variation

PC2 spans the direction of the 2nd most variation

PC3 spans the direction of the 3rd most variation

PC4 spans the direction of the 4th most variation

If we have $n$ observations and $p$ variables (dimensions), there are at most $\min(n - 1, p)$ PCs.

US Arrest Data in 1973

dim(USArrests)

[1] 50  4

head(USArrests, 16)

            Murder Assault UrbanPop Rape
Alabama       13.2     236       58   21
Alaska        10.0     263       48   44
Arizona        8.1     294       80   31
Arkansas       8.8     190       50   20
California     9.0     276       91   41
Colorado       7.9     204       78   39
Connecticut    3.3     110       77   11
Delaware       5.9     238       72   16
Florida       15.4     335       80   32
Georgia       17.4     211       60   26
Hawaii         5.3      46       83   20
Idaho          2.6     120       54   14
Illinois      10.4     249       83   24
Indiana        7.2     113       65   21
Iowa           2.2      56       57   11
Kansas         6.0     115       66   18

PC Loading Vectors on `USArrests`

pca_output <- prcomp(USArrests, scale = TRUE)

## rotation matrix provides PC loadings
(pca_output$rotation <- -pca_output$rotation)

          PC1   PC2   PC3    PC4
Murder   0.54  0.42 -0.34 -0.649
Assault  0.58  0.19 -0.27  0.743
UrbanPop 0.28 -0.87 -0.38 -0.134
Rape     0.54 -0.17  0.82 -0.089

PCs are unique up to a sign change, so -pca_output$rotation gives us the same PCs as pca_output$rotation does. The sign just change the direction, not the angle.

$\text{PC1} = 0.54 \times \text{Murder} + 0.58 \times \text{Assault} + 0.28 \times \text{UrbanPop} + 0.54 \times \text{Rape}$

$\text{PC2} = 0.42 \times \text{Murder} + 0.19 \times \text{Assault} - 0.87 \times \text{UrbanPop} - 0.17 \times \text{Rape}$

We have 4 PCs because $\min(n-1, p) = \min(50-1, 4) = 4$.

To perform PCA in R, it cannot be easier.
We just need to use the function prcomp(), and put the data set in the function. Then we get everything we want.
Here I choose to scale the data because variables are not measured in the same scale. For example, Murder rate and UrbanPop are measured in different units.
This makes sure that every variable has variance 1, and our analysis is not affected by units.
The PCA results are stored as a list in pca_output object.
OK first we can look at the rotation matrix because it provides PC loadings, and so we know what PC1 and PC2 are.
Changing signs for easier interpretation of PCs
Those PC loadings define how we rotates the coordinates to obtain the PCs. ???
Again, PC1 is just a linear combination or weighted average of the 4 variables, same as PC2. ???
Again, PC1 is just a linear combination or weighted average of the 4 variables, same as PC2.
PCs are unique up to a sign change, so -pca_output$rotation gives us the same PCs as pca_output$rotation does.
We have 4 PCs because $\min(n-1, k) = \min(50-1, 4) = 4$.

PC Scores

The value of the rotated data, the data values of each PC are stored in pca_output$x

head(pca_output$x <- -pca_output$x, 16) |> round(2)

              PC1   PC2   PC3   PC4
Alabama      0.98  1.12 -0.44 -0.15
Alaska       1.93  1.06  2.02  0.43
Arizona      1.75 -0.74  0.05  0.83
Arkansas    -0.14  1.11  0.11  0.18
California   2.50 -1.53  0.59  0.34
Colorado     1.50 -0.98  1.08  0.00
Connecticut -1.34 -1.08 -0.64  0.12
Delaware     0.05 -0.32 -0.71  0.87
Florida      2.98  0.04 -0.57  0.10
Georgia      1.62  1.27 -0.34 -1.07
Hawaii      -0.90 -1.55  0.05 -0.89
Idaho       -1.62  0.21  0.26  0.49
Illinois     1.37 -0.67 -0.67  0.12
Indiana     -0.50 -0.15  0.23 -0.42
Iowa        -2.23 -0.10  0.16 -0.02
Kansas      -0.79 -0.27  0.03 -0.20

Interpretation of PCs

pca_output$rotation

          PC1   PC2   PC3    PC4
Murder   0.54  0.42 -0.34 -0.649
Assault  0.58  0.19 -0.27  0.743
UrbanPop 0.28 -0.87 -0.38 -0.134
Rape     0.54 -0.17  0.82 -0.089

PCs are less interpretable than original features.
The first loading vector places approximately equal weight on Assualt, Murder and Rape, with much less weights on UrbanPop.
PC1 roughly corresponds to a overall serious crime rate.

The second loading vector places most of its weight on UrbanPop, and much less weight on the other 3 features.
PC2 roughly corresponds to the level of urbanization.

Intepretability decreases with the order of PCs.
So it’s easier to give PC1 a meaningful name than PC2, and PC2 is more meaningful than PC3, and so on. Because the PCs after the first 2 PCs usually explain quite small variation in the data, and some of them may be just noises.
Let’s see if we can interpret these PCs.
First keep in mind that PCs are less interpretable than original features. Sometimes we even don’t know how to interpret it, especially for PCs that explain small variations. So this is the price we pay for dimension reduction.
But let’s look at this example.
The first loading vector places approximately equal weight on Assualt, Murder and Rape, with much less weights on UrbanPop.
So PC1 roughly corresponds to a overall serious crime rate because PC1 explains the variations of data caused by those crime variables Assualt, Murder and Rape.
On the contrary, the second loading vector places most of its weight on UrbanPop, and much less weight on the other 3 features.
So we can say PC2 roughly corresponds to the level of urbanization.
So you get the idea, Assualt, Murder and Rape are similar each other because they all are measures of crime rate.
So when reducing dimensions, we sort of combine the three similar variables together to become a one single index that measures an overall crime rate.
Urban population measures a totally different thing. So the variation created by this variable cannot be explained well by the crime rate, and it should be absorbed in PC2.

2D Representation of the 4D data

            PC1   PC2   PC3   PC4
Wisconsin -2.06 -0.61 -0.14 -0.18
Wyoming   -0.62  0.32 -0.24  0.16

Higher value of PC1 means higher crime rates (roughly).
Higher value of PC2 means higher level of urbanization (roughly).

2D Representation of the 4D data: biplot

biplot(pca_output, xlabs = state.abb, 
       scale = 0)

Top axis: PC1 loadings
Right axis: PC2 loadings
Red arrows: PC1 and PC2 loading vector, e.g., (0.28, -0.87) for UrbanPop.
Crime-related variables (Assualt, Murder and Rape) are located close to each other.
UrbanPop is far from the other three.
Assualt, Murder and Rape are more correlated, and UrbanPop is less correlated with the other three.

We can simply use the function biplot() to show the 2D Representation of the data.
This function also provide loading vector of PC1 and PC2 that gives us an idea of which direction means a large value of a variable/feature.
This is why it is called biplot because we plot two things in one single plot.
Here, Top axis is for PC1 loadings
Right axis is PC2 loadings
Red arrows: PC1 and PC2 loading vector, e.g., (0.28, 0.87) for UrbanPop.
So NJ and Ca have pretty high urban pop rate because they are large in the UrbanPop arrow direction.
Crime-related variables (Assualt, Murder and Rape) are located close to each other.
UrbanPop is far from the other three.
Assualt, Murder and Rape are more correlated, and UrbanPop is less correlated with the other three.
Assualt, Murder and Rape sort of point to the same direction as PC1 and UrbanPop points to the same direction as PC2.

Proportion of Variance Explained

(pc_var <- pca_output$sdev ^ 2)

[1] 2.48 0.99 0.36 0.17

(pc_var_prop <- pc_var / sum(pc_var))

[1] 0.620 0.247 0.089 0.043

PC1 explains $62\%$ of the variations in the data, and PC2 explains $24.7\%$ of the variance.
PC1 and PC2 explain about $87\%$ of the variance, and the last two PCs explain only $13\%$.
2D plot provides pretty accurate summary of the data.

Scree Plot

Look for a point at which the proportion of variance explained by each subsequent PC drops off.

Mathematics of PCA

Principal Components

The $k$th PC is

\[Z_k = \phi_{1k}X_1 + \phi_{2k}X_2 + \dots + \phi_{pk}X_p,\] where $\sum_{j=1}^p\phi_{jk}^2=1$.

$(\phi_{1k}, \phi_{2k}, \dots, \phi_{pk})'$ is the PC loading vector.
The PC1 loading vector solves

\[\max_{\phi_{11}, \phi_{21}, \dots, \phi_{p1}} \left\{ \frac{1}{n}\sum_{i=1}^n z_{i1}^2\right\} = \left\{ \frac{1}{n}\sum_{i=1}^n \left( \sum_{j=1}^p \phi_{j1} x_{ij}\right)^2\right\} \quad \text{s.t.} \quad \sum_{j=1}^p\phi_{j1}^2 = 1\]

Maximize the sample variance of the projected points, or the scores $z_{11}, z_{21}, \dots, z_{n1}$.
The PC loading vector defines a direction in feature space along which the data vary the most.

Principal Components

For $k$th PC, $k > 1$,

\[\max_{\phi_{1k}, \phi_{2k}, \dots, \phi_{pk}} \left\{ \frac{1}{n}\sum_{i=1}^n z_{ik}^2\right\} = \left\{ \frac{1}{n}\sum_{i=1}^n \left( \sum_{j=1}^p \phi_{jk} x_{ij}\right)^2\right\} \quad \text{s.t.} \quad \sum_{j=1}^p\phi_{jk}^2 = 1, \text{ and } {\mathbf{z}_m}'\mathbf{z}_k = 0, \, m = 1, \dots, k-1\] where

$\mathbf{z}_l = (z_{1l}, z_{2l}, \dots, z_{nl})'$

Low-Rank Approximation

PCs provide low-dimensional planes that are closest to the observations.
$x_{ij} \approx \sum_{m=1}^M z_{im}\phi_{jm}$ with equality when $M = \min(n-1, p)$

\[(z_{im}, \phi_{jm}) = \mathop{\mathrm{arg\,min}}_{a_{im}, b_{jm}} \left\{ \sum_{j=1}^p\sum_{i=1}^n\left( x_{ij} - \sum_{m=1}^Ma_{im}b_{jm}\right)^2\right\}\]

Scaling the Variables

apply(USArrests, 2, var)

  Murder  Assault UrbanPop     Rape 
      19     6945      210       88

If we perform PCA on the unscaled variables, PC1 loading vector will have a large loading for Assault.
When all the variables are of the same type, no need to scale the variables.

PCA and SVD

PCA is equivalent to singular value decomposition (SVD) of $\mathbf{X}$

\[\mathbf{X}_{n\times p} = \mathbf{U}_{n \times n} \mathbf{D}_{n \times p} \mathbf{V}'_{p \times p}\] where $\mathbf{U}$ and $\mathbf{V}$ are orthogonal matrices, and $\mathbf{D}$ is diagonal with diagonal elements singular values $d_1 \ge d_2 \ge \cdots \ge d_p$.

PC Scores and SVD

${\bf Z}_{n \times p} = [{\bf z}_1 \, {\bf z}_2 \, \cdots \, {\bf z}_p] = \mathbf{X}\mathbf{V}$
The $j$th PC is the $j$th column of ${\bf Z}$ given by ${\bf z}_j = (z_{1j}, z_{2j}, \dots, z_{nj})' = {\bf Xv}_j$.
Project $\mathbf{X}$ onto the space spanned by $\mathbf{v}_j$s
$\mathbf{v}_j$s are loading vectors.

${\bf Z}_{n \times p} = \mathbf{U}\mathbf{D}$
${\bf z}_j = d_j\mathbf{u}_j$.
$\mathbf{u}_j$s are the unit PC vectors, and $d_j$ controls the variation along the $\mathbf{u}_j$ direction.

Low-Rank Approximation and SVD

${\mathbf{X}} = {\mathbf{U}}{\mathbf{D}}{\mathbf{V}}'$

$\mathbf{A}= \mathbf{U}_{n\times M}\mathbf{D}_{M \times M}$ and $\mathbf{B}' = \mathbf{V}_{M\times p}'$ are the minimizer of

\[ \min_{\mathbf{A}\in \mathbf{R}^{n \times M}, \mathbf{B}\in \mathbf{R}^{p \times M}} \|\mathbf{X}- \mathbf{A}\mathbf{B}' \|\] where

$\mathbf{U}_{n\times M}$ is $\mathbf{U}$ with the first $M$ columns
$\mathbf{D}_{M \times M}$ is $\mathbf{D}$ of the first $M$ rows and columns
$\mathbf{V}_{M\times p}'$ is $\mathbf{V}'$ with the first $M$ rows

$x_{ij} = \sum_{m = 1}^p d_mu_{im}v_{jm}$
$x_{ij} \approx \sum_{m = 1}^M d_mu_{im}v_{jm}$

PCA and Eigendecomposition

PCA is equivalent to eigendecomposition of $\mathbf{X}'\mathbf{X}$ or $\boldsymbol \Sigma= \text{Cov}(\mathbf{X}) = \dfrac{\mathbf{X}'\mathbf{X}}{n-1}$¹, the covariance matrix of $\mathbf{X}$.

\[\mathbf{X}'\mathbf{X}= \mathbf{V}\mathbf{D}^2\mathbf{V}' = d_1^2\mathbf{v}_1\mathbf{v}_1' + \dots + d_p^2\mathbf{v}_p\mathbf{v}_p'\]

Total variation: $\sum_{j=1}^p \text{Var}(\mathbf{x}_j) = \frac{1}{n-1}\sum_{j=1}^pd_j^2 = p$
Variation of $m$th PC: $\text{Var}(\mathbf{z}_m) = \frac{d_m^2}{n-1}$

Transformed OLS Regression

Transform ${\bf y = X\boldsymbol \beta+ \boldsymbol \epsilon}$ into \[{\bf y = XVV'\boldsymbol \beta+ \boldsymbol \epsilon= Z\boldsymbol \alpha} + \boldsymbol \epsilon\] where $\mathbf{Z}= \mathbf{X}\mathbf{V}$ and $\boldsymbol \alpha= \mathbf{V}'\boldsymbol \beta$, or $\boldsymbol \beta= \mathbf{V}\boldsymbol \alpha$.

The least- squares estimator $\hat{\boldsymbol \alpha} = (\mathbf{Z}'\mathbf{Z})^{-1}\mathbf{Z}'\mathbf{y}= \mathbf{D}^{-2}\mathbf{Z}'\mathbf{y}$
$\text{Var}\left(\hat{\boldsymbol \alpha} \right) = \sigma^2 (\mathbf{Z}'\mathbf{Z})^{-1} = \sigma^2 \mathbf{D}^{-2}$
A small $d_j$ means that the variance of $\alpha_j$ will be large.
The PC regression combats multicollinearity by using less PCs $(m \ll p)$ in the model.
- The PCs corresponding to tiny $d_j$s (huge variance) are removed and least-squares is applied to the remaining PCs.

Even though the new variables are orthogonal, the same magnitude of variance due to the ill-conditioning in $X'X$ is retained. The total variance has merely been redistributed.
${\bf y = X\boldsymbol \beta+ \boldsymbol \epsilon}$
${\bf y = XVV'\boldsymbol \beta+ \boldsymbol \epsilon= Z\boldsymbol \alpha} + \boldsymbol \epsilon$ where ${\bf Z =XV}$ and ${\bf \boldsymbol \alpha= V'\boldsymbol \beta}$.
The least- squares estimator $\hat{\boldsymbol \alpha} = {\bf (Z'Z)^{-1}Z'y = \Lambda^{-1}Z'y}$
$\var\left(\hat{\boldsymbol \alpha} \right) = \sigma^2 {\bf (Z'Z)}^{-1} = \sigma^2 {\bf \Lambda}^{-1}$
A small $\lambda_j$ means that the variance of $\alpha_j$ will be large.
The PC regression combats multicollinearity by using less PCs in the model.
Assume that the regressors are arranged in order of decreasing eigenvalues, $\lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_k > 0$.
The PCs corresponding to near-zero eigenvalues are removed and least squares applied to the remaining PCs.
If all PCs are in the model, it is just a rotation of the regressors. The same magnitude of variance is retained.
The small variation in the $z_2$ direction is responsible for the large variance in $\hat{\alpha}_2$.
A large $\lambda_1$ allows variance of $\hat{\alpha}_1$ to be relatively unaffected by the dependency of $x_1$ and $x_2$.

Principal Component Regression (PCR)

Keep $m < p$ PCs with the largest $m$ singular values $d_1, d_2, \dots, d_m$.
The least- squares estimator is \[\hat{\boldsymbol \alpha}_m = ({\bf Z} _m ' {\bf Z} _m) ^ {-1} {\bf Z}_m ' {\bf y} = \mathbf{D}_m ^ {-2} {\bf Z}_m '{\bf y}\] where ${\bf Z}_m = [{\bf z}_1 \, {\bf z}_2 \, \cdots \, {\bf z}_m]$ and ${\bf \mathbf{D}}_m = \text{diag}(d_1, d_2, \dots, d_m)$

${\bf b}_{pc} = {\bf V}_m\hat{\boldsymbol \alpha}_m = [{\bf v}_1 \, {\bf v}_2 \, \cdots \, {\bf v}_m] \begin{bmatrix} \hat{\alpha}_1 \\ \hat{\alpha}_2 \\ \vdots \\ \hat{\alpha}_m \end{bmatrix}$
${\bf b}_{pc}$ is biased but has smaller variance than ${\bf b}$.

Keep the first $m < k$ PCs.
The least- squares estimator is \[\hat{\boldsymbol \alpha}_m = ({\bf Z} _m ' {\bf Z} _m) ^ {-1} {\bf Z}_m ' {\bf y} = {\bf \Lambda} _m ^ {-1} {\bf Z}_m '{\bf y}\] where ${\bf Z}_m = [{\bf z}_1 \, {\bf z}_2 \, \cdots \, {\bf z}_m]$ and ${\bf \Lambda}_m = \text{diag}(\lambda_1, \lambda_2, \dots, \lambda_m)$
To transform back to the original coefficients, since ${\bf \boldsymbol \alpha= V'\boldsymbol \beta}$, ${\bf \boldsymbol \beta= V\boldsymbol \alpha}$
${\bf b}_{pc} = {\bf V}_m\hat{\boldsymbol \alpha}_m = [{\bf v}_1 \, {\bf v}_2 \, \cdots \, {\bf v}_m] \begin{bmatrix} \hat{\alpha}_1 \\ \hat{\alpha}_2 \\ \vdots \\ \hat{\alpha}_m \end{bmatrix}$
${\bf b}_{pc}$ is biased but has smaller variance than ${\bf b}$. Deleting PCs does not imply deletion of any of original regressors. The selected $m$ PCs do contain information provided by all $k$ original regressors.

Principal Component Regression

PCR performs well when the directions in which $X_1, \dots ,X_p$ show the most variation (the first few PCs) are the directions that are associated with $Y$.

Deleting PCs does not imply deletion of any of original regressors. The selected $m$ PCs contain information provided by all $k$ original regressors. (Not a feature selection method)

PCR and ridge regression are closely related. Ridge regression is a continuous version of PCR.

`pls::pcr()`

library(pls)
set.seed(1)
pcr.fit <- pcr(Salary ~ ., data = Hitters, 
               subset = train,
               scale = TRUE, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")

`pls::pcr()`

pcr.pred <- predict(pcr.fit, x[test, ], ncomp = 5)
mean((pcr.pred - y.test)^2)

[1] 142812

pcr.fit <- pcr(y ~ x, scale = TRUE, ncomp = 5)
summary(pcr.fit)

Data:   X dimension: 263 19 
    Y dimension: 263 1
Fit method: svdpc
Number of components considered: 5
TRAINING: % variance explained
   1 comps  2 comps  3 comps  4 comps  5 comps
X    38.31    60.16    70.84    79.03    84.29
y    40.63    41.58    42.17    43.22    44.90

Other Dimension Reduction (Latent Variable) Methods

Kernel Principal Component Analysis https://ml-explained.com/blog/kernel-pca-explained
Probabilistic PCA
Factor Analysis
Autoencoders
t-SNE

Dimension Reduction

Unsupervised Learning

Unsupervised Learning

Background: Dimensions

One-Dimension (1D) Number line

One-Dimension (1D) Number line: Uniform students

1D Number line: Non-uniform students

Two-Dimensions (2D) X-Y Scatter plot: High Correlated

Two-Dimensions (2D) X-Y Scatter plot: No correlated

Three-Dimensions (3D) X-Y-Z Scatter plot

Four-Dimension (4D) X-Y-Z-? Scatter plot

How about Pair Plots?

Tooooo Many Pair Plots!

Dimension Reduction

Variation mostly from One Variable

Variation mostly from One Variable

Variation mostly from One Variable

Variation mostly from One Variable

Principal Component Analysis (PCA)

Idea of PCA

PCA Illustration: 2 Variable Example

Step 1: Shift (or standardize) the Data

Step 2: Find a Line that Fits the Data the Best

Step 2: Find a Line that Fits the Data the Best

Step 2: Find a Line that Fits the Data the Best

The Meaning of the Best line

The Meaning of the Best line

PC1 and PC2

Linear Combinations

Rotate Everything so that PC1 is Horizontal

Variation

How about 3 or More Variables?

US Arrest Data in 1973

PC Loading Vectors on USArrests

PC Scores

Interpretation of PCs

2D Representation of the 4D data

2D Representation of the 4D data: biplot

Proportion of Variance Explained

Scree Plot

Mathematics of PCA

Principal Components

Principal Components

Low-Rank Approximation

Scaling the Variables

PCA and SVD

PC Scores and SVD

Low-Rank Approximation and SVD

PCA and Eigendecomposition

Transformed OLS Regression

Principal Component Regression (PCR)

Principal Component Regression

pls::pcr()

pls::pcr()

Other Dimension Reduction (Latent Variable) Methods

PC Loading Vectors on `USArrests`

`pls::pcr()`

`pls::pcr()`