MSSC 6250 Statistical Machine Learning
Supervised learning investigates and models the relationships between responses and inputs.
Can you come up with any real-world examples describing relationships between variables deterministically?
Can you provide some real examples that the variables are related each other, but not perfectly related?
💵 In general, one with more years of education earns more.
💵 Any two with the same years of education may have different annual income.
What are the unexplained variation coming from?
What other factors (variables) may affect a person’s income?
your income = f(years of education, major, GPA, college, parent's income, ...)
In Intro Stats, what is the form of f and what assumptions you made on the random error \epsilon ?
income
and years of education
.Big problem: f(x) is unknown and needs to be estimated.
Parametric (Linear regression)
Nonparametric (LOESS)
All models are wrong, but some are useful. – George Box (1919-2013)
For any given training data, decide which method produces the best results.
Selecting the best approach is one of the most challenging parts of machine learning.
Need some way to measure how well its predictions actually match the training/test data.
Are \text{MSE}_{\texttt{Tr}} and \text{MSE}_{\texttt{Te}} the same? When to use which?
\text{MSE}_{\texttt{Tr}} or \text{MSE}_{\texttt{Te}} is smaller?
A more complex model produces a more flexible or wiggly regression curve \hat{f}(x) that matches the training data better.
y = \beta_0+ \beta_1x + \beta_2x^2 + \cdots + \beta_{10}x^{10} + \epsilon is more complex than y = \beta_0+ \beta_1x + \epsilon
Overfitting: A too complex model fits the training data extremely well and too hard, picking up some patterns and variations simply caused by random noises that are not the properties of the true f, and not existed in the any unseen test data.
Underfitting: A model that is too simple to capture complex patterns or shapes of the true f(x). The estimate \hat{f}(x) is rigid and far away from data.
How \text{MSE}_{\texttt{Tr}} and \text{MSE}_{\texttt{Te}} change with model complexity?
It’s common that no test data are available. Can we select a model that minimize \text{MSE}_{\texttt{Tr}}, since the training data and test data appear to be closed related?
MSE | Overfit | Underfit |
---|---|---|
Train | tiny | big |
Test | big | big |
Given any new input x_0,
\text{MSE}_{\hat{f}} = E\left[\left(\hat{f}(x_0) - f(x_0)\right)^2\right] = \left[\text{Bias}\left(\hat{f}(x_0) \right)\right]^2 + \text{Var}\left(\hat{f}(x_0)\right)
where \text{Bias}\left(\hat{f}(x_0) \right) = E\left[ \hat{f}(x_0)\right] - f(x_0).
The expected test MSE of y_0 at x_0 is \text{MSE}_{y_0} = E\left[\left(y_0 - \hat{f}(x_0)\right)^2\right] = \text{MSE}_{\hat{f}} + \text{Var}(\epsilon)
Note
We never know the true expected test MSE, and prefer the model with the smallest expected test MSE estimate.
Overfitting: Low bias and High variance
Underfitting: High bias and Low variance
Model 1: Under-fitting y = \beta_0+\beta_1x+\epsilon
Model 2: Right-fitting y = \beta_0+\beta_1x+ \beta_2x^2 + \epsilon
Model 3: Over-fitting y = \beta_0+\beta_1x+ \beta_2x^2 + \cdots + \beta_9x^9 + \epsilon
To see expectation/bias and variance, we need replicates of training data.