MSSC 6250 Statistical Machine Learning
Supervised Learning: response \(Y\) and features \(X_1, X_2, \dots, X_p\) measured on \(n\) observations.
Unsupervised Learning: only features \(X_1, X_2, \dots, X_p\) measured on \(n\) observations.
English and Math measure an overall academic performance.
English and Math measure different abilities.
One variable represents one dimension.
With many variables in the data, we live in a high dimensional world.
GOAL:
Find a low-dimensional (usually 2D) representation of the data that captures as much of the information all of those variables provide as possible.
Use two created variables to represent all \(p\) variables, and make a scatter plot of the two created variables to learn what our observations look like as if they lived in the high dimensional space.
Why and when can we omit dimensions?
PCA is a dimension reduction tool that finds a low-dimensional representation of a data set that contains as much as possible of variation.
Each observation lives in a high-dimensional space (lots of variables), but not all of these dimensions (variables) are equally interesting/important.
The concept of interesting/important is measured by the amount that the observations vary along each dimension.
Principal Component 1 (PC1): maximizes the variance of the projected points.
PC1 is the line in the Eng-Math space that is closest to the \(n\) observations
PC1 is the best 1D representation of the 2D data
1D representation
2D representation
If the variation for PC1 is \(17\) and the variation for PC2 is \(2\), the total variation presented in the data is \(17+2=19\).
PC1 accounts for \(17/19 = 89\%\) of the total variation, and PC2 accounts for \(2/19 = 11\%\) of the total variation.
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21
Alaska 10.0 263 48 44
Arizona 8.1 294 80 31
Arkansas 8.8 190 50 20
California 9.0 276 91 41
Colorado 7.9 204 78 39
Connecticut 3.3 110 77 11
Delaware 5.9 238 72 16
Florida 15.4 335 80 32
Georgia 17.4 211 60 26
Hawaii 5.3 46 83 20
Idaho 2.6 120 54 14
Illinois 10.4 249 83 24
Indiana 7.2 113 65 21
Iowa 2.2 56 57 11
Kansas 6.0 115 66 18
USArrests
pca_output <- prcomp(USArrests, scale = TRUE)
## rotation matrix provides PC loadings
(pca_output$rotation <- -pca_output$rotation)
PC1 PC2 PC3 PC4
Murder 0.54 0.42 -0.34 -0.649
Assault 0.58 0.19 -0.27 0.743
UrbanPop 0.28 -0.87 -0.38 -0.134
Rape 0.54 -0.17 0.82 -0.089
-pca_output$rotation
gives us the same PCs as pca_output$rotation
does. The sign just change the direction, not the angle.\(\text{PC1} = 0.54 \times \text{Murder} + 0.58 \times \text{Assault} + 0.28 \times \text{UrbanPop} + 0.54 \times \text{Rape}\)
\(\text{PC2} = 0.42 \times \text{Murder} + 0.19 \times \text{Assault} - 0.87 \times \text{UrbanPop} - 0.17 \times \text{Rape}\)
pca_output$x
PC1 PC2 PC3 PC4
Alabama 0.98 1.12 -0.44 -0.15
Alaska 1.93 1.06 2.02 0.43
Arizona 1.75 -0.74 0.05 0.83
Arkansas -0.14 1.11 0.11 0.18
California 2.50 -1.53 0.59 0.34
Colorado 1.50 -0.98 1.08 0.00
Connecticut -1.34 -1.08 -0.64 0.12
Delaware 0.05 -0.32 -0.71 0.87
Florida 2.98 0.04 -0.57 0.10
Georgia 1.62 1.27 -0.34 -1.07
Hawaii -0.90 -1.55 0.05 -0.89
Idaho -1.62 0.21 0.26 0.49
Illinois 1.37 -0.67 -0.67 0.12
Indiana -0.50 -0.15 0.23 -0.42
Iowa -2.23 -0.10 0.16 -0.02
Kansas -0.79 -0.27 0.03 -0.20
PC1 PC2 PC3 PC4
Murder 0.54 0.42 -0.34 -0.649
Assault 0.58 0.19 -0.27 0.743
UrbanPop 0.28 -0.87 -0.38 -0.134
Rape 0.54 -0.17 0.82 -0.089
Assualt
, Murder
and Rape
, with much less weights on UrbanPop
.UrbanPop
, and much less weight on the other 3 features. PC1 PC2 PC3 PC4
Wisconsin -2.06 -0.61 -0.14 -0.18
Wyoming -0.62 0.32 -0.24 0.16
Higher value of PC1 means higher crime rates (roughly).
Higher value of PC2 means higher level of urbanization (roughly).
UrbanPop
.Assualt
, Murder
and Rape
) are located close to each other.UrbanPop
is far from the other three.Assualt
, Murder
and Rape
are more correlated, and UrbanPop
is less correlated with the other three.[1] 2.48 0.99 0.36 0.17
[1] 0.620 0.247 0.089 0.043
Look for a point at which the proportion of variance explained by each subsequent PC drops off.
\[Z_k = \phi_{1k}X_1 + \phi_{2k}X_2 + \dots + \phi_{pk}X_p,\] where \(\sum_{j=1}^p\phi_{jk}^2=1\).
\((\phi_{1k}, \phi_{2k}, \dots, \phi_{pk})'\) is the PC loading vector.
The PC1 loading vector solves
\[\max_{\phi_{11}, \phi_{21}, \dots, \phi_{p1}} \left\{ \frac{1}{n}\sum_{i=1}^n z_{i1}^2\right\} = \left\{ \frac{1}{n}\sum_{i=1}^n \left( \sum_{j=1}^p \phi_{j1} x_{ij}\right)^2\right\} \quad \text{s.t.} \quad \sum_{j=1}^p\phi_{j1}^2 = 1\]
Maximize the sample variance of the projected points, or the scores \(z_{11}, z_{21}, \dots, z_{n1}\).
The PC loading vector defines a direction in feature space along which the data vary the most.
For \(k\)th PC, \(k > 1\),
\[\max_{\phi_{1k}, \phi_{2k}, \dots, \phi_{pk}} \left\{ \frac{1}{n}\sum_{i=1}^n z_{ik}^2\right\} = \left\{ \frac{1}{n}\sum_{i=1}^n \left( \sum_{j=1}^p \phi_{jk} x_{ij}\right)^2\right\} \quad \text{s.t.} \quad \sum_{j=1}^p\phi_{jk}^2 = 1, \text{ and } {\mathbf{z}_m}'\mathbf{z}_k = 0, \, m = 1, \dots, k-1\] where
\(\mathbf{z}_l = (z_{1l}, z_{2l}, \dots, z_{nl})'\)
PCs provide low-dimensional planes that are closest to the observations.
\(x_{ij} \approx \sum_{m=1}^M z_{im}\phi_{jm}\) with equality when \(M = \min(n-1, p)\)
\[(z_{im}, \phi_{jm}) = \mathop{\mathrm{arg\,min}}_{a_{im}, b_{jm}} \left\{ \sum_{j=1}^p\sum_{i=1}^n\left( x_{ij} - \sum_{m=1}^Ma_{im}b_{jm}\right)^2\right\}\]
If we perform PCA on the unscaled variables, PC1 loading vector will have a large loading for Assault
.
When all the variables are of the same type, no need to scale the variables.
PCA is equivalent to singular value decomposition (SVD) of \(\mathbf{X}\)
\[\mathbf{X}_{n\times p} = \mathbf{U}_{n \times n} \mathbf{D}_{n \times p} \mathbf{V}'_{p \times p}\] where \(\mathbf{U}\) and \(\mathbf{V}\) are orthogonal matrices, and \(\mathbf{D}\) is diagonal with diagonal elements singular values \(d_1 \ge d_2 \ge \cdots \ge d_p\).
\({\bf Z}_{n \times p} = [{\bf z}_1 \, {\bf z}_2 \, \cdots \, {\bf z}_p] = \mathbf{X}\mathbf{V}\)
The \(j\)th PC is the \(j\)th column of \({\bf Z}\) given by \({\bf z}_j = (z_{1j}, z_{2j}, \dots, z_{nj})' = {\bf Xv}_j\).
Project \(\mathbf{X}\) onto the space spanned by \(\mathbf{v}_j\)s
\(\mathbf{v}_j\)s are loading vectors.
\({\bf Z}_{n \times p} = \mathbf{U}\mathbf{D}\)
\({\bf z}_j = d_j\mathbf{u}_j\).
\(\mathbf{u}_j\)s are the unit PC vectors, and \(d_j\) controls the variation along the \(\mathbf{u}_j\) direction.
\({\mathbf{X}} = {\mathbf{U}}{\mathbf{D}}{\mathbf{V}}'\)
\(\mathbf{A}= \mathbf{U}_{n\times M}\mathbf{D}_{M \times M}\) and \(\mathbf{B}' = \mathbf{V}_{M\times p}'\) are the minimizer of
\[ \min_{\mathbf{A}\in \mathbf{R}^{n \times M}, \mathbf{B}\in \mathbf{R}^{p \times M}} \|\mathbf{X}- \mathbf{A}\mathbf{B}' \|\] where
\(x_{ij} = \sum_{m = 1}^p d_mu_{im}v_{jm}\)
\(x_{ij} \approx \sum_{m = 1}^M d_mu_{im}v_{jm}\)
PCA is equivalent to eigendecomposition of \(\mathbf{X}'\mathbf{X}\) or \(\boldsymbol \Sigma= \text{Cov}(\mathbf{X}) = \dfrac{\mathbf{X}'\mathbf{X}}{n-1}\)1, the covariance matrix of \(\mathbf{X}\).
\[\mathbf{X}'\mathbf{X}= \mathbf{V}\mathbf{D}^2\mathbf{V}' = d_1^2\mathbf{v}_1\mathbf{v}_1' + \dots + d_p^2\mathbf{v}_p\mathbf{v}_p'\]
Total variation: \(\sum_{j=1}^p \text{Var}(\mathbf{x}_j) = \frac{1}{n-1}\sum_{j=1}^pd_j^2 = p\)
Variation of \(m\)th PC: \(\text{Var}(\mathbf{z}_m) = \frac{d_m^2}{n-1}\)
Transform \({\bf y = X\boldsymbol \beta+ \boldsymbol \epsilon}\) into \[{\bf y = XVV'\boldsymbol \beta+ \boldsymbol \epsilon= Z\boldsymbol \alpha} + \boldsymbol \epsilon\] where \(\mathbf{Z}= \mathbf{X}\mathbf{V}\) and \(\boldsymbol \alpha= \mathbf{V}'\boldsymbol \beta\), or \(\boldsymbol \beta= \mathbf{V}\boldsymbol \alpha\).
The least- squares estimator \(\hat{\boldsymbol \alpha} = (\mathbf{Z}'\mathbf{Z})^{-1}\mathbf{Z}'\mathbf{y}= \mathbf{D}^{-2}\mathbf{Z}'\mathbf{y}\)
\(\text{Var}\left(\hat{\boldsymbol \alpha} \right) = \sigma^2 (\mathbf{Z}'\mathbf{Z})^{-1} = \sigma^2 \mathbf{D}^{-2}\)
A small \(d_j\) means that the variance of \(\alpha_j\) will be large.
The PC regression combats multicollinearity by using less PCs \((m \ll p)\) in the model.
pls::pcr()
pls::pcr()
[1] 142812
Data: X dimension: 263 19
Y dimension: 263 1
Fit method: svdpc
Number of components considered: 5
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps
X 38.31 60.16 70.84 79.03 84.29
y 40.63 41.58 42.17 43.22 44.90
Kernel Principal Component Analysis https://ml-explained.com/blog/kernel-pca-explained
Probabilistic PCA
Factor Analysis
Autoencoders
t-SNE