MSSC 6250 Statistical Machine Learning
A neural network takes an input vector of \(p\) variables \(X = (X_1, X_2, \dots, X_p)\) and builds a nonlinear function \(f(X)\) to predict the response \(Y\).
A neural network with several hidden layers is called a deep neural network, or deep learning.
Starts from inputs \(X\), for each hidden neuron \(A_k\), \(k = 1, \dots, K\),
\(A_k(X) = g(w_{k0} + w_{k1}X_1 + \cdots + w_{kp}X_p)\)
\(f(X) = g_f(\beta_0 + \beta_1A_1(X) + \cdots + \beta_KA_K(X))\)
\(g_k(z)\) and \(g_f(z)\) are (non)linear activation functions that are specified in advance.
\(\beta_0, \dots, \beta_K\) and \(w_{10}, \dots, w_{1p}, \dots, w_{K0}, \dots, w_{Kp}\) are parameters to be estimated.
Can we represent a linear regression model as a neural network?
YES! Linear regression is a single layer neural network with
Can we represent a binary logistic regression model as a neural network?
YES! Binary logistic regression is a single layer neural network with
Activation functions are usually continuous for optimization purpose.
The activation function used in hidden layers is typically chosen based on the type of neural network architecture.
never use softmax and identity functions in the hidden layers.
For output activation function,
The signal to noise ratio is high.
The sample size is huge.
Interpretability of the model is not a priority.
When possible, try the simpler models as well, and then make a choice based on the performance/complexity tradeoff.
Occam’s razor principle: when faced with several methods that give roughly equivalent performance, pick the simplest.
\[\min_{\boldsymbol \beta, \{\mathbf{w}\}_1^K} \frac{1}{2}\sum_{i=1}^n\left(y_i - f(x_i) \right)^2,\] where \[f(x_i) = \beta_0 + \sum_{k=1}^K\beta_Kg\left( w_{k0} + \sum_{j=1}^pw_{kj}x_{xj}\right).\]
For the objective function \(R(\theta)\) and the parameter vector \(\theta\) to be estimated, gradient descent keeps finding new parameter value that reduces the objective until the objective fails to decrease.
\(\theta^{(t+1)} = \theta^{(t)} -\rho\nabla R(\theta^{(t)})\), where \(\rho\) is the learning rate that is typically small like 0.001.
\(\nabla R(\theta^{(t)}) = \left. \frac{\partial R(\theta)}{\partial \theta} \right \rvert_{\theta = \theta^{(t)}}\)
\[\begin{align} R(\theta) \overset{\triangle}{=} \sum_{i=1}^nR_i(\theta) =& \frac{1}{2}\sum_{i=1}^n \big(y_i - f_{\theta}(x_i)\big)^2\\ =& \frac{1}{2} \sum_{i=1}^n \big(y_i - \beta_0 - \beta_1 g(w_1' x_i) - \cdots - \beta_K g(w_K' x_i) \big)^2 \\ \end{align}\]
With \(z_{ik} = w_k' x_i\),
\[\frac{\partial R_i(\theta)}{\partial \beta_{k}} = \frac{\partial R_i(\theta)}{\partial f_{\theta}(x_i)} \cdot \frac{\partial f_{\theta}(x_i)}{\partial \beta_{k}} = {\color{red}{-\big( y_i - f_{\theta}(x_i)\big)}} \cdot g(z_{ik})\]
\[\frac{\partial R_i(\theta)}{w_{kj}} = \frac{\partial R_i(\theta)}{\partial f_{\theta}(x_i)} \cdot \frac{\partial f_{\theta}(x_i)}{\partial g(z_{ik})} \cdot \frac{\partial g(z_{ik})}{\partial z_{ik}} \cdot \frac{\partial z_{ik}}{\partial w_{kj}} = {\color{red}{-\big( y_i - f_{\theta}(x_i)\big)}} \cdot \beta_k \cdot g'(z_{ik}) \cdot x_{ij}\]
Stochastic Gradient Descent
Dropout Learning
Convolutional Neural Network (Spatial modeling)
Recurrent Neural Network (Temporal modeling)
Bayesian Deep Learning