MSSC 6250 Statistical Machine Learning
When flipping a fair coin, we say that “the probability of flipping Heads is 0.5.” How do you interpret this probability?
An election is coming up and a pollster claims that candidate Yu has a 0.9 probability of winning. How do you interpret this probability?
Two claims.
(1) Ben claims he can predict the coin flip outcome. To test his claim, you flip a fair coin 8 times and he correctly predicts all.
(2) Emma claims she can distinguish natural and artificial sweeteners. To test her claim, you give her 8 samples and she correctly identifies each.
In light of these experiments, what do you conclude?
Suppose that during a doctor’s visit, you tested positive for COVID. If you only get to ask the doctor one question, which would it be?
The probability that some outcome of a process will be obtained is defined as
the relative frequency with which that outcome would be obtained if the process were repeated a large number of times independently under similar conditions.
Frequency Relative Frequency
Heads 4 0.4
Tails 6 0.6
Total 10 1.0
---------------------
Frequency Relative Frequency
Heads 512 0.512
Tails 488 0.488
Total 1000 1.000
---------------------
How can we live if we don’t change? - Beyoncé. Lyric from “Satellites.”
Using data and prior beliefs to update our knowledge (posterior), and repeating.
We continuously update our knowledge about the world as we accumulate lived experiences, or collect data.
In Question 4,
Test Positive | Test Negative | Total | |
---|---|---|---|
COVID | 3 | 1 | 4 |
No COVID | 9 | 87 | 96 |
Total | 12 | 88 | 100 |
\(H_0\): Do not have COVID vs. \(H_1\): Have COVID
A frequestist assesses the uncertainty of the observed data in light of an assumed hypothesis \(P(Data \mid H_0) = 9/96\)
A Bayesian assesses the uncertainty of the hypothesis in light of the observed data \(P(H_0 \mid Data) = 9/12\)
type n percent
fake 60 0.4
real 90 0.6
Total 150 1.0
!
is more consistent with fake news. title_has_excl fake real
FALSE 44 88
TRUE 16 2
Total 60 90
\(F\): an article is fake.
The prior probability model
Event | \(F\) | \(F^c\) | Total |
---|---|---|---|
Probability \(P(\cdot)\) | 0.4 | 0.6 | 1 |
title_has_excl fake real
FALSE 44 88
TRUE 16 2
Total 60 90
\(D\): an article title has exclamation mark.
Conditional probability: \(P(D \mid F) = 16/60 = 0.27\); \(P(D \mid F^c) = 2/90 = 0.02\).
!
(observed data)\[L(F \mid D) = P(D \mid F) \text{ and } L(F^c \mid D) = P(D \mid F^c)\]
Event | \(F\) | \(F^c\) | Total |
---|---|---|---|
Probability \(P(\cdot)\) | 0.4 | 0.6 | 1 |
Likelihood \(L(\cdot \mid D)\) | 0.27 | 0.02 | 0.29 |
\[\begin{align*} P(F \mid D) &= \frac{P(F \cap D)}{P(D)}\\ &= \frac{L(F \mid D)P(F)}{P(D)}\end{align*}\]
\[\text{posterior = } \frac{\text{likelihood} \cdot \text{prior }}{ \text{normalizing constant}} \]
“The president has a funny secret!”
a feature that’s more common to fake news.
Event | \(F\) | \(F^c\) | Total |
---|---|---|---|
Prior prob \(P(\cdot)\) | 0.4 | 0.6 | 1 |
Posterior prob \(P(\cdot \mid D)\) | 0.89 | 0.11 | 1 |
For any random variables parameter \(\theta\) and data \({\bf Y} = (Y_1, \dots, Y_n)\),
\(\pi(\theta)\): the prior pmf/pdf of \(\theta\)
\(L(\theta \mid y_1,\dots, y_n)\): the likelihood of \(\theta\) given observed data \(\mathbf{y}= \{y_i \}_{i = 1}^n\).
The posterior distribution of \(\theta\) given \(\mathbf{y}\) is
\[\pi(\theta \mid \mathbf{y}) = \frac{L(\theta \mid \mathbf{y})\pi(\theta)}{p(\mathbf{y})}\] where \[p(\mathbf{y}) = \begin{cases} \int_{\Theta} L(\theta \mid \mathbf{y})\pi(\theta) ~ d\theta & \text{if } \theta \text{ is continuous }\\ \sum_{\theta \in \Theta} L(\theta \mid \mathbf{y})\pi(\theta) & \text{if } \theta \text{ is discrete } \end{cases}\]
\[\pi(\theta \mid \mathbf{y}) = \frac{L(\theta \mid \mathbf{y})\pi(\theta)}{p(\mathbf{y})} \propto_{\theta} L(\theta \mid \mathbf{y})\pi(\theta)\]
\[\text{posterior } \propto \text{ likelihood } \cdot \text{ prior } \]
Key: Describe prior and data information using probabilistic models.
\[\pi(\theta \mid \alpha, \beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha - 1}(1-\theta)^{\beta-1}\]
\(\theta \sim \text{beta}(\alpha, \beta)\)
In the prior model, \(\alpha\) and \(\beta\) are hyperparameters to be chosen to reflect our prior information.
Michelle’s support is centered round 45%, and she polled at around 35% in the dreariest days and around 55% in the best days.
Choose \(\alpha\) and \(\beta\) so that the prior mean is about 0.45 and the range is from 0.35 to 0.55.
\(\mathrm{E}(\theta) = \frac{\alpha}{\alpha + \beta}\)
\(\mathrm{Var}(\theta) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\)
You plan to conduct a new poll of \(n = 50\) Cheeseheads and record \(Y\), the number that support Michelle.
What distribution can be used for modeling likelihood connecting the data \(y\) and the parameter we are interested, \(\theta\)?
\[Y \mid \theta \sim \text{binomial}(n=50, \theta)\]
\[L(\theta \mid y = 30) = {50 \choose 30}\theta^{30}(1-\theta)^{20}, \quad \theta \in (0, 1)\]
\[\begin{align}Y \mid \theta &\sim \text{binomial}(n=50, \theta)\\ \theta &\sim \text{beta}(45, 55) \end{align}\]
Goal: Obtain the posterior distribution \(\pi(\theta \mid y)\).
\[ \begin{align} \pi(\theta \mid y) &\propto_{\theta} L(\theta \mid y)\pi(\theta) \\ &= {50 \choose 30}\theta^{30}(1-\theta)^{20} \times \frac{\Gamma(100)}{\Gamma(45)\Gamma(55)}\theta^{44}(1-\theta)^{54}\\ &\propto_{\theta} \theta^{74}(1-\theta)^{74}\\ &= \frac{\Gamma(150)}{\Gamma(75)\Gamma(75)} \theta^{74}(1-\theta)^{74} \\ &= \text{beta}(75, 75)\end{align} \]
using the fact that \(\int_{\mathcal{X}} f(x) dx = 1\) for any pdf \(f(x)\).
The parameter is a random variable in the Bayesian world.
It is fixed and constant in the frequentist world.
In simple linear regression,
\[Y_i \mid \beta_0, \beta_1, \sigma \stackrel{ind}{\sim} N\left(\mu_i, \sigma^2 \right) \quad \text{with} \quad \mu_i = \beta_0 + \beta_1 X_i\]
This normal data model is our likelihood \(L(\boldsymbol \theta= (\beta_0, \beta_1, \sigma) \mid \mathcal{D} = \{\mathbf{y}, \mathbf{x}\})\), as it evaluates how the data are compatible with different possible values of parameters.
To be a Bayesian, what do we do next?
\[\begin{align} \beta_0 \sim N(m_0, s_0^2) \\ \beta_1 \sim N(m_1, s_1^2) \end{align}\]
\[\begin{align} \sigma \sim \text{Exp}(\lambda) \end{align}\] \(\pi(\sigma) = \lambda e^{-\lambda \sigma}\), \(\lambda > 0\) and \(\mathrm{E}(\sigma) = 1/\lambda\), and \(\mathrm{Var}(\sigma) = 1/\lambda^2\)
\[\begin{align} Y_i \mid \beta_0, \beta_1, \sigma &\stackrel{ind}{\sim} N\left(\mu_i, \sigma^2 \right) \quad \text{with} \quad \mu_i = \beta_0 + \beta_1 X_i \\ \beta_0 &\sim N(m_0, s_0^2) \\ \beta_1 &\sim N(m_1, s_1^2) \\ \sigma &\sim \text{Exp}(\lambda) \end{align}\]
bayesrules::bikes
Data in Washington, D.C.
Rows: 500
Columns: 2
$ rides <int> 654, 1229, 1454, 1518, 1362, 891, 1280, 1220, 1137, 1368, 13…
$ temp_feel <dbl> 64.7, 49.0, 51.1, 52.6, 50.8, 46.6, 45.6, 49.2, 46.4, 45.6, …
\(\beta_{0c} \sim N(5000, 1000^2)\)
\(\beta_{1} \sim N(100, 40^2)\)
\(\sigma \sim \text{Exp}(0.0008)\) because \(\mathrm{E}(\sigma) = 1/\lambda = 1/0.0008 = 1250\)
Update our prior understanding of the relationship between ridership and temperature using data information provided by likelihood.
The joint posterior distribution is
\[\pi(\beta_0, \beta_1, \sigma \mid \mathbf{y}) = \frac{L(\beta_0, \beta_1, \sigma \mid \mathbf{y})\pi(\beta_0, \beta_1, \sigma)}{p(\mathbf{y})}\] where
\(L(\beta_0, \beta_1, \sigma \mid \mathbf{y}) = p(\mathbf{y}\mid \beta_0, \beta_1, \sigma) = \prod_{i=1}^np(y_i \mid \beta_0, \beta_1, \sigma)\)
\(\pi(\beta_0, \beta_1, \sigma) = \pi(\beta_0)\pi(\beta_1)\pi(\sigma)\)
\(p(\mathbf{y}) = \int \int \int \left[\prod_{i=1}^np(y_i \mid \beta_0, \beta_1, \sigma)\right]\pi(\beta_0)\pi(\beta_1)\pi(\sigma) ~ d\beta_0d\beta_1d\sigma\)
There are lots of ways to generate/approximate the posterior distribution of parameters. One method is Markov chain Monte Carlo (MCMC).
rstanarm::stan_glm()
rstanarm
uses RStan syntax1 to do Bayesian inference for applied regression models (arm).Generate 4 Monte Carlo chains (chains = 4
), each having 10000 iterations (iter = 5000*2
).
Each iteration draw a posterior sample of the \(\beta_0\), \(\beta_1\), and \(\sigma\).
After tossing out the first half of Markov chain values from the warm-up or burn-in phase, stan_glm()
simulation produces four parallel chains of length 5000 for each model parameter:
\(\{ \beta_0^{(1)}, \beta_0^{(2)}, \dots, \beta_0^{(5000)} \}\), \(\{ \beta_1^{(1)}, \beta_1^{(2)}, \dots, \beta_1^{(5000)} \}\), \(\{ \sigma^{(1)}, \sigma^{(2)}, \dots, \sigma^{(5000)} \}\)
(Intercept) temp_feel sigma
1.01 1.02 1.01
(Intercept) temp_feel sigma
1 1 1
# Posterior summary statistics
tidy(bike_model, effects = c("fixed", "aux"),
conf.int = TRUE, conf.level = 0.80)
# A tibble: 4 × 5
term estimate std.error conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2194. 351. -2648. -1743.
2 temp_feel 82.2 5.04 75.7 88.7
3 sigma 1282. 40.9 1231. 1336.
4 mean_PPD 3487. 81.1 3384. 3592.
\[-2196 + 82.2 X\]
The 80% credible interval for \(\beta_1\) is \((75.7, 88.5)\).
Given the data, the probability that \(\beta_1\) is between 75.7 and 88.5 is 80%., i.e., \(P(\beta_1 \in(75.7, 88.5) \mid \mathbf{y}, \mathbf{x}) = 80\%\).
# Store the 4 chains for each parameter in 1 data frame
bike_model_df <- as.data.frame(bike_model)
nrow(bike_model_df)
[1] 20000
(Intercept) temp_feel sigma
1 -2303 83.7 1294
2 -2483 86.5 1332
3 -1998 78.0 1236
4 -2240 81.2 1281
5 -2252 82.3 1309
6 -2272 83.9 1272
How to obtain \(P(\beta_1 > 0 \mid \mathbf{y}, \mathbf{x})\)?
For each posterior draw \(\{\beta_0^{(t)}, \beta_1^{(t)}, \sigma^{(t)} \}_{t = 1}^{20000}\), we have the posterior predictive distribution \[Y_i^{(t)} \sim N\left(\beta_0^{(t)} + \beta_1^{(t)}X_i, \, (\sigma^{(t)})^2\right)\]
Lasso and Ridge regression can be interpreted as a Bayesian regression model.
The same data-level likelihood \[Y_i \mid \boldsymbol \beta, \sigma \stackrel{ind}{\sim} N\left(\mu_i, \sigma^2 \right) \quad \text{with} \quad \mu_i = \mathbf{x}_i'\boldsymbol \beta\]
We use prior distributions to regularize how parameters behave.
Lasso: \(\beta_j \stackrel{iid}{\sim} \text{Laplace}\left(0, \tau(\lambda)\right)\)
Ridge: \(\beta_j \stackrel{iid}{\sim} N\left(0, \tau(\lambda)\right)\)
Lasso
\[\beta_j \stackrel{iid}{\sim} \text{Laplace}\left(0, \tau(\lambda)\right)\]
Lasso solution is the posterior mode of \(\boldsymbol \beta\)
\[\boldsymbol \beta^{(l)} = \mathop{\mathrm{arg\,max}}_{\boldsymbol \beta} \,\,\, \pi(\boldsymbol \beta\mid \mathbf{y}, \mathbf{x})\]
Ridge
\[\beta_j \stackrel{iid}{\sim} N\left(0, \tau(\lambda)\right)\]
Ridge solution is the posterior mode/mean of \(\boldsymbol \beta\)
\[\boldsymbol \beta^{(r)} = \mathop{\mathrm{arg\,max}}_{\boldsymbol \beta} \,\, \, \pi(\boldsymbol \beta\mid \mathbf{y}, \mathbf{x})\]