MSSC 6250 Statistical Machine Learning
In many applications, the response or error term are nonnormal.
Besides normal, the response in a generalized linear model (GLM) can be Bernoulli, binomial, Poisson, etc.
There is no assumption that \(\mathrm{Var}(y_i)\) is homogeneous: Both \(E(y_i)\) and \(\mathrm{Var}(y_i)\) may vary with the regressors from data point to data point.
Ordinary least squares does not apply when the response is not normal.
Normal vs. COVID vs. Smoking
fake news vs. true news
Two popular approaches when modeling binary data
Given the training data \(\{(x_i, y_i)\}_{i=1}^n\), we build a classifier to predict whether people will default on their credit card payment \((Y)\) yes
or no
, based on monthly credit card balance \((X)\).
\[Y =\begin{cases} 0 & \quad \text{if not default}\\ 1 & \quad \text{if default} \end{cases}\]
What are potential issues with this dummy variable approach?
\(\hat{Y} = b_0 + b_1X\) estimates \(P(Y = 1 \mid X) = P(default = yes \mid balance)\), which is equivalent to linear discriminant analysis (LDA) discussed later.
However,
\[Y =\begin{cases} 1 & \quad \text{stroke}\\ 2 & \quad \text{drug overdose} \\ 3 & \quad \text{epileptic seisure} \end{cases}\]
The coding
epileptic seisure
\(>\) drug overdose
\(>\) stroke
stroke
\(-\) drug overdose
\(=\) drug overdose
\(-\) epileptic seizure
.default
using a S-shaped curve.We use normal distribution for numerical \(y\). What distribution we can use for binary \(y\) that takes value 0 or 1?
default
\((y = 1)\) and not default
\((y = 0)\) is a Bernoulli variable. But,\[ y_i \mid x_i \stackrel{indep}{\sim} \text{Bernoulli}(\pi_i = \pi(x_i)) = \text{binomial}(m=1,\pi = \pi_i) \]
balance
. \(x_1 = 2000\) has a larger \(\pi_1 = \pi(2000)\) than \(\pi_2 = \pi(500)\) with \(x_2 = 500\)
Instead of predicting \(y_i\) directly, we use the predictors to model its probability of success, \(\pi_i\).
But how?
\[\eta = \text{logit}(\pi) = \ln\left(\frac{\pi}{1-\pi}\right)\]
For \(i = 1, \dots, n\) and with \(p\) predictors: \[Y_i \mid \pi_i({\bf x}_i) \stackrel{indep}{\sim} \text{Bernoulli}(\pi_i), \quad {\bf x}_i' = (x_{i1}, \dots, x_{ip})\] \[\text{logit}(\pi_i) = \ln \left( \frac{\pi_i}{1 - \pi_i} \right) = \eta_i = \beta_0+\beta_1 x_{i1} + \cdots + \beta_p x_{ip} = {\bf x}_i'\boldsymbol \beta\]
\[\small E(Y_i) = \pi_i = \frac{1}{1 + \exp(-{\bf x}_i'\boldsymbol \beta)}\] \[\small \hat{\pi}_i = \frac{1}{1 + \exp(-{\bf x}_i'\hat{\boldsymbol \beta} )}\]
Default
Data in ISLR2
glm()
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.6513 0.36116 -29 3.6e-191
balance 0.0055 0.00022 25 2.0e-137
\(\hat{\eta} = \text{logit}(\hat{\pi}) = \ln \left( \frac{\hat{\pi}}{1 - \hat{\pi}}\right) = -10.65 + 0.0055 \times \text{balance}\)
The ratio \(\frac{\pi}{1-\pi} \in (0, \infty)\) is called the odds of some event.
\[\ln \left( \frac{\pi(x)}{1 - \pi(x)} \right)= \beta_0 + \beta_1x\]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.651 0.36 -29 0
balance 0.005 0.00 25 0
balance
increases the log odds of default
by 0.005 units.default
increases by 0.5% with additional one unit of credit card balance
.\[\log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = -10.65+0.0055\times 2000\]
\[ \hat{\pi} = \frac{1}{1+\exp(-(-10.65+0.0055 \times 2000)} = 0.586\]
What is the probability of default when the balance is 500? What about balance 2500?
glm()
uses maximum likelihood to estimate the parameters \(\boldsymbol \beta\).The log-likelihood is given by
\[\ell(\boldsymbol \beta) = \sum_{i=1}^n \log \, p(y_i \mid x_i, \boldsymbol \beta).\]
Using Bernoulli probabilities, we have
\[\begin{align} \ell(\boldsymbol \beta) =& \sum_{i=1}^n \log \left\{ \pi(\mathbf{x}_i)^{y_i} [1-\pi(\mathbf{x}_i)]^{1-y_i} \right\}\\ =& \sum_{i=1}^n y_i \log \frac{\pi(\mathbf{x}_i)}{1-\pi(\mathbf{x}_i)} + \log [1-\pi(\mathbf{x}_i)] \\ =& \sum_{i=1}^n y_i \mathbf{x}_i' \boldsymbol \beta- \log [ 1 + \exp(\mathbf{x}_i' \boldsymbol \beta)] \end{align}\]
\[\boldsymbol \beta^{\text{new}} = \boldsymbol \beta^{\text{old}} - \left[ \frac{\partial ^2 \ell(\boldsymbol \beta)}{\partial \boldsymbol \beta\partial \boldsymbol \beta'}\right] ^{-1}\frac{\partial \ell(\boldsymbol \beta)}{\partial \boldsymbol \beta}\]
optim()
x <- as.matrix(cbind("intercept" = 1, Default$balance))
y <- as.matrix(as.numeric(Default$default) - 1)
beta <- rep(0, ncol(x)) ## start with 0
optim(beta, fn = my_loglik, gr = my_gradient, method = "BFGS",
x = x, y = y,
control = list("fnscale" = -1))$par # "fnscale" = -1 for maximization
[1] -10.6454 0.0055
True 0 | True 1 | |
---|---|---|
Predict 0 | True Negative (TN) | False Negative (FN) |
Predict 1 | False Positive (FP) | True Positive (TP) |
Sensitivity (True Positive Rate) \(= P( \text{predict 1} \mid \text{true 1}) = \frac{TP}{TP+FN}\)
Specificity (True Negative Rate) \(= P( \text{predict 0} \mid \text{true 0}) = \frac{TN}{FP+TN}\)
Accuracy \(= \frac{TP + TN}{TP+FN+FP+TN} = \frac{1}{m}\sum_{j=1}^mI(y_j = \hat{y}_j)\), where \(y_j\)s are true test labels and \(\hat{y}_j\)s are their corresponding predicted label.
A good classifier is one which the test accuracy rate is highest, or test error rate \(\frac{1}{m}\sum_{j=1}^mI(y_j \ne \hat{y}_j)\) is smallest.
No Yes
FALSE 9625 233
TRUE 42 100
caret
(Classification And REgression Training)yardstick
of tidymodels
The ROC curve plots True Positive Rate (Sensitivity) vs. False Positive Rate (1 - Specificity)
Packages: ROCR, pROC, yardstick::roc_curve()
Find the area under the curve:
From the training set, choose the cut-off that maximizes
G-Mean \(= \sqrt{\text{TPR} * \text{TNR}}\) (geometric mean of TPR and TNR)
Youden’s J index \(= \text{TPR} + \text{TNR} - 1\) (the distance to the 45 degree identity line)
that minimizes
\(\sqrt{(1 - \text{TPR})^2 + (1 - \text{TNR})^2}\) (the distance to the optimal point)
With Youden’s J index, the optimal threshold is \(3.18\%\).
When classifying \(K > 2\) categories, we can consider multinomial logistic regression1.
The response should be nominal. If \(Y_i\) is ordinal, we should consider the ordinal regression.
\[Pr(Y = k \mid {\bf x}) = \dfrac{e^{\beta_{k0}+\beta_{k1}x_1 + \cdots + \beta_{kp}x_p}}{ \sum_{l=1}^{K}e^{\beta_{l0}+\beta_{l1}x_1 + \cdots + \beta_{lp}x_p}}\] for \(k = 1, 2, \dots, K-1\), and \[Pr(Y = K \mid {\bf x}) = \dfrac{1}{ \sum_{l=1}^{K}e^{\beta_{l0}+\beta_{l1}x_1 + \cdots + \beta_{lp}x_p}}.\]
For \(k = 1, \dots, K-1\), \[\log\left( \frac{Pr(Y = k \mid \mathbf{x})}{Pr(Y = K \mid \mathbf{x})} \right) = \beta_{k0}+\beta_{k1}x_1 + \cdots + \beta_{kp}x_p.\]
First select a single class to serve as the baseline, the \(K\)th class for example.1
\(\pi_{ik} = Pr(Y_i = k \mid {\bf x}_i)\): the probability that the \(i\)-th response falls in the \(k\)-th category.
\(\sum_{k = 1}^K \pi_{ik} = 1\) for each observation \(i\).
What is the value of \(\eta_{iK}\)?
For \(k = 1, \dots, K\),
\[\pi_{ik} = \dfrac{e^{\eta_{ik}}}{\sum_{l=1}^{K}e^{\eta_{il}}}; \left(\pi_{iK} = \dfrac{1}{\sum_{l=1}^{K}e^{\eta_{il}}} \right)\]
Rows: 200
Columns: 3
$ prog <fct> vocation, general, vocation, vocation, vocation, general, vocati…
$ ses <fct> low, middle, high, low, middle, high, middle, middle, middle, mi…
$ write <dbl> 35, 33, 39, 37, 31, 36, 36, 31, 41, 37, 44, 33, 31, 44, 35, 44, …
prog
(program type: general
, academic
, vocation
)ses
(social economic status: low
, middle
, high
)write
(writing score)Program type is affected by social economic status
prog
ses general academic vocation
low 16 19 12
middle 20 44 31
high 9 42 7
and writing score
M SD
general 51 9.4
academic 56 7.9
vocation 47 9.3
The level (coding order) of prog
is
[1] "general" "academic" "vocation"
[1] "academic" "general" "vocation"
academic
= 1, general
= 2, vocation
= 3)ses
, low
= 1, middle
= 2, high
= 3\[\log \left( \frac{Pr(prog = general)}{Pr(prog = academic)}\right) = b_{10} + b_{11}(ses = 2) + b_{12}(ses = 3) + b_{13}write\]
\[\log \left( \frac{Pr(prog = vocation)}{Pr(prog = academic)}\right) = b_{20} + b_{21}(ses = 2) + b_{22}(ses = 3) + b_{23}write\]
\(b_{13}\): All others held constant, the amount changed in the log odds of being in general
vs. academic
program as write
increases one unit.
\(b_{21}\) All others held constant, the amount changed in the log odds of being in vocation
vs. academic
program if moving from ses = "low"
to ses = "middle"
.
nnet::multinom()
multino_fit <- nnet::multinom(prog2 ~ ses + write, data = multino_data, trace = FALSE)
summ <- summary(multino_fit)
summ$coefficients
(Intercept) sesmiddle seshigh write
general 2.9 -0.53 -1.16 -0.058
vocation 5.2 0.29 -0.98 -0.114
glmnet()
uses the softmax coding that treats all \(K\) classes symmetrically, and coefficients have a different meaning. (ISL eq. (4.13), (4.14))
The fitted values, predictions, log odds between any pair of classes, and other key model outputs remain the same.
academic general vocation
1 0.15 0.34 0.51
2 0.12 0.18 0.70
3 0.42 0.24 0.34
4 0.17 0.35 0.48
5 0.10 0.17 0.73
6 0.35 0.24 0.41
write
at its mean and examine the predicted probabilities for each level of ses
.Logistic regression uses the logit function \(\eta = g(\pi) = \ln\left( \dfrac{\pi}{1-\pi}\right)\) as the link function.
\(\pi = \dfrac{1}{1 + e^{-\eta}} \in (0, 1)\) is the CDF of the logistic distribution, whose PDF is of the from \[f(\eta) = \frac{e^{-\eta}}{(1 + e^{-\eta})^2}, \quad \eta \in(-\infty, \infty)\]
Could use a different link function and hence CDF to describe the probability (S-shaped) curve.
Distribution | Link Function \(\eta = g(\pi)\) | CDF \(\pi = g^{-1}(\eta)\) | |
---|---|---|---|
Logistic | logit: \(\ln\left( \dfrac{\pi}{1-\pi}\right)\) | \(\dfrac{1}{1 + e^{-\eta}}\) | \(\dfrac{e^{-\eta}}{(1 + e^{-\eta})^2}\) |
Normal | probit: \(\Phi^{-1}(\pi)\) | \(\Phi(\eta)\) | \(\frac{1}{\sqrt{2\pi}} e^{-\frac{\eta^2}{2}}\) |
Gumbel | cloglog: \(\log\left( -\log(1-\pi)\right)\) | \(1 - \exp(-\exp(\eta))\) | \(e^{-(\eta + e ^ {-\eta})}\) |
Any transformation that maps probabilities into the real line could be used to produce a GLM for binary classification, as long as the transformation is one-to-one continuous and differentiable.
\[\pi = F(\eta)\] \[\eta = F^{-1}(\pi)\]
Name | Link Function \(\eta = g(\pi)\) |
---|---|
logit | \(\ln\left( \dfrac{\pi}{1-\pi}\right)\) |
probit | \(\Phi^{-1}(\pi)\) |
cloglog | \(\log\left( -\log(1-\pi)\right)\) |
Repeated measures and binomial logistic regression
Regression diagnostics for binary data
Penalized logistic regression
Exponential family
Poisson regression