MSSC 6250 Statistical Machine Learning
“Multivariate” \(=\) two or more random variables
\(\mathbf{Y}\in \mathbb{R}^d \sim N_d\left(\boldsymbol \mu, \boldsymbol \Sigma\right)\)
\[\mathbf{Y}= \begin{pmatrix} Y_1 \\ Y_2 \end{pmatrix}\] \[\boldsymbol \mu= \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}\] \[\boldsymbol \Sigma= \begin{pmatrix} \sigma_1^2 & \rho_{12}\sigma_1\sigma_2\\ \rho_{12}\sigma_1\sigma_2 & \sigma_2^2 \end{pmatrix}\]
\[\boldsymbol \mu= \begin{pmatrix} 0 \\ 0 \end{pmatrix}\] \[\boldsymbol \Sigma= \begin{pmatrix}1 & 0\\ 0 & 1 \end{pmatrix}\]
\[\boldsymbol \mu= \begin{pmatrix} 0 \\ 1 \end{pmatrix}\] \[\boldsymbol \Sigma= \begin{pmatrix}1 & 0\\ 0 & 0.2 \end{pmatrix}\]
\[\boldsymbol \mu= \begin{pmatrix} 0 \\ 0 \end{pmatrix}\] \[\boldsymbol \Sigma= \begin{pmatrix}1 & 0.9\\ 0.9 & 1 \end{pmatrix}\]
\(d = 5\)
\[\boldsymbol \mu= \begin{pmatrix} 0 \\ 0 \\ 0\\0\\0\end{pmatrix} \quad \quad \boldsymbol \Sigma= \begin{pmatrix}1 & 0.99 & 0.98 & 0.97 & 0.96\\ 0.99 & 1 & 0.99 & 0.98 & 0.97 \\ 0.98 & 0.99 & 1 & 0.99 & 0.98 \\ 0.97 & 0.98 & 0.97 & 1 & 0.99\\0.96 & 0.97 & 0.98 & 0.99 & 1 \end{pmatrix}\]
\(d = 50\)
\[\boldsymbol \mu= \begin{pmatrix} 0 \\ 0 \\ \vdots \\0\\0\end{pmatrix} \quad \quad \boldsymbol \Sigma= \begin{pmatrix} 1 & 0.99 & 0.98 & 0.97 & 0.96 & \cdots \\ 0.99 & 1 & 0.99 & 0.98 & 0.97 & \cdots\\ 0.98 & 0.99 & 1 & 0.99 & 0.98 & \cdots\\ 0.97 & 0.98 & 0.97 & 1 & 0.99 & \cdots\\ 0.96 & 0.97 & 0.98 & 0.99 & 1 & \cdots \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \end{pmatrix}\]
Each line is one sample (path).
Think of Gaussian processes as an infinite dimensional distribution over functions
A stochastic process \(f(x), x \in \mathcal{X} \subset \mathbb{R}^D\), is a function whose values are random variables, for any value of \(x\).
Usually for \(D = 1\), the process is a temporal process, and for \(D > 1\), it is referred to as a spatial process.
\[f(x_1), \dots, f(x_n) \sim N_n\left(\boldsymbol \mu, \boldsymbol \Sigma\right)\]
\[\text{E}(f(x)) = m(x)\] \[\text{Cov}(f(x), f(x')) = k(x, x')\]
Popular choices of \(m(\cdot)\) are \(m(x) = 0\) or \(m(x) = \text{const}\) for all \(x\), or \(m(x) = \beta'x\)
Care more about the covariance function or kernel function as it governs the how the process looks like by defining the similarity between data points.
\[\text{Cov}(f(x), f(x')) = k(x, x')\]
\[k(x, x' \mid \tau, h) = \tau^2 \exp\left(-\frac{(x - x')^2}{2h^2} \right)\]
\[k(x, x') = \exp\left(-\frac{1}{2}(x - x')^2 \right)\]
\[k(x, x') = \exp\left(-\frac{1}{2}\frac{(x - x')^2}{ {\color{red}{0.25}}^2} \right)\]
\[k(x, x') = \exp\left(-\frac{1}{2}\frac{(x - x')^2}{ {\color{red}{4}}^2} \right)\]
\[k(x, x') = {\color{red}{10^2}}\exp\left(-\frac{1}{2}(x - x')^2 \right)\]
\[k(x, x' \mid \tau, h) = \tau^2 \exp\left(-\frac{|x - x'|}{h} \right)\]
\[k(x, x') = \min(x, x')\]
\[k(x, x') = \begin{cases} 1 & \quad \text{if } x = x' \\ 0 & \quad \text{otherwise} \end{cases}\]
\[\mathbf{Y}= \begin{pmatrix} \mathbf{Y}_1 \\ \mathbf{Y}_2 \end{pmatrix} \sim N\left(\boldsymbol \mu, \boldsymbol \Sigma\right), \quad \boldsymbol \mu= \begin{pmatrix} \boldsymbol \mu_1 \\ \boldsymbol \mu_2 \end{pmatrix}, \quad \boldsymbol \Sigma= \begin{pmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22} \end{pmatrix}\] \[(\mathbf{Y}_2 \mid \mathbf{y}_1 = \mathbf{y}_1) \sim N\left(\boldsymbol \mu_2 + \Sigma_{21} \Sigma_{11}^{-1} (\mathbf{y}_1 - \boldsymbol \mu_1), \Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12}\right)\]
\[f(x_1), \dots, f(x_n), f(x) \sim N(\boldsymbol \mu, \boldsymbol \Sigma)\]
\[f(x) \mid f(x_1), \dots, f(x_n) \sim N(\boldsymbol \mu^*, \boldsymbol \Sigma^*)\] where \(\boldsymbol \mu^*\) and \(\boldsymbol \Sigma^*\) are as on the previous slide.
\[f(\cdot) \sim GP(0, k(\cdot, \cdot))\]
\[f(x_1), \dots, f(x_n), f(x^*) \sim N\left(0, \boldsymbol \Sigma\right)\]
\[\boldsymbol \Sigma= \left(\begin{array}{ccc|c} k(x_1, x_1) & \cdots & k(x_1, x_n) & k(x_1, x^*) \\ \vdots & & \vdots & \vdots \\ k(x_n, x_1) & \cdots & k(x_n, x_n) & k(x_n, x^*) \\ \hline k(x^*, x_1) & \cdots & k(x^*, x_n) & k(x^*, x^*) \end{array} \right) = \left(\begin{array}{ccc|c} & & & \\ & K & & K_* \\ & & & \\ \hline & K_*^T & & K_{**} \end{array} \right) \]
\[f(x^*) \mid f(x_1), \dots, f(x_n) \sim N\left( \boldsymbol \mu^*, \boldsymbol \Sigma^*\right)\] where
\[\boldsymbol \mu^* = K_{*}^TK^{-1}{\bf f}\] with \({\bf f} = \left(f(x_1), \dots, f(x_n)\right)^T\)
\[\boldsymbol \Sigma^* = K_{**} - K_{*}^TK^{-1}K_{*}\]
—–: posterior mean \(\boldsymbol \mu^*\)
—–: 95% posterior predictive interval \(\boldsymbol \mu^* \pm 1.96 \boldsymbol \Sigma^*\)
So far we treat \(f(x)\), the function values as data points.
There is no noise, and every posterior curve interpolates data points.
\[\begin{align} y_i &= f(x_i) + \epsilon_i, \quad \epsilon_i\sim N(0, \sigma^2) \\ f(\cdot) &\sim GP(0, k(\cdot, \cdot; \theta))\end{align}\]
$$$$
\[y_1, \dots, y_n, f(x^*) \sim N\left(0, \boldsymbol \Sigma\right), \quad \boldsymbol \Sigma= \left(\begin{array}{ccc|c} & & & \\ & K + \sigma^2I & & K_* \\ & & & \\ \hline & K_*^T & & K_{**} \end{array} \right) \]
\[f(x^*) \mid y_1, \dots, y_n \sim N\left( \boldsymbol \mu^*, \boldsymbol \Sigma^*\right), \quad \boldsymbol \mu^* = K_{*}^T{\color{red}{(K + \sigma^2I)^{-1}}}{\bf y}, \quad \boldsymbol \Sigma^* = K_{**} - K_{*}^T{\color{red}{(K + \sigma^2I)^{-1}}}K_{*}\] with \({\bf y} = \left(y_1, \dots, y_n\right)^T\)
Uses the observed data to estimate parameters \(\theta = (\sigma^2, \tau^2, h)\)
Find an empirical Bayes estimate for \(\theta\) from the marginal likelihood
\[ p\left( \mathbf{y}\mid \theta \right) = \int p\left( \mathbf{y}\mid {\bf f} \right) p\left( {\bf f} \mid \theta \right) \, d {\bf f} = N\left(\mathbf{0}, K(\tau, h)+\sigma^2I \right)\]
\[\begin{align} y_i &= f(x_i) + \epsilon_i, \, \, \epsilon_i \stackrel{\rm iid}{\sim} N(0, \sigma^2), \quad i = 1, \dots, n,\\ f(\cdot) &\sim GP\left(\mu, k(\cdot, \cdot)\right),\,\, \text{Cov}(f(x_i), f(x_j)) = k(x_i, x_j)\\ \sigma^2 &\sim IG(a_{\sigma}, b_{\sigma})\\ \tau^2 &\sim IG(a_{\tau}, b_{\tau})\\ h &\sim Ga(a_{h}, b_{h})\\ \mu & \sim N(0, b_{\mu}) \end{align}\]