MSSC 6250 Statistical Machine Learning
SVMs have been shown to perform well in a variety of settings, and are often considered one of the best “out of the box” classifiers.
Start with the maximal margin classifier (1960s), then the support vector classifier (1990s), and then the support vector machine.
\({\cal D}_n = \{\mathbf{x}_i, y_i\}_{i=1}^n\)
In SVM, we code the binary outcome \(y\) as 1 or -1, representing one class and the other.
The goal is to find a linear classifier \(f(\mathbf{x}) = \beta_0 + \mathbf{x}' \boldsymbol \beta\) so that the classification rule is the sign of \(f(\mathbf{x})\):
\[ \hat{y} = \begin{cases} +1, \quad \text{if} \quad f(\mathbf{x}) > 0\\ -1, \quad \text{if} \quad f(\mathbf{x}) < 0 \end{cases} \]
The \(f(\mathbf{x}) = \beta_0 + \mathbf{x}' \boldsymbol \beta= 0\) is a hyperplane, which is a subspace of dimension \(p-1\) in the \(p\)-dimensional space.
\(f(\mathbf{x}) = \beta_0 + \beta_1X_1+\beta_2X_2 = 0\) is a straight line (hyperplane of dimension one) in the 2-dimensional space.
The classification rule is \(y_i f(\mathbf{x}_i) >0\).
If our data can be perfectly separated using a hyperplane, there exists an infinite number of such hyperplanes. But which one is the best?
A natural choice is the maximal margin hyperplane (optimal separating hyperplane), which is the separating hyperplane that is farthest from the training points.
The training points lied on the dashed lines are support vectors:
It can lead to overfitting when \(p\) is large.
Hope the classifier will also have a large margin on the test data.
In linear SVM, \(f(\mathbf{x}) = \beta_0 + \mathbf{x}' \boldsymbol \beta\). When \(f(\mathbf{x}) = 0\), it corresponds to a hyperplane that separates the two classes:
\[\{ \mathbf{x}: \beta_0 + \mathbf{x}'\boldsymbol \beta = 0 \}\]
For this separable case, all observations with \(y_i = 1\) are on one side \(f(\mathbf{x}) > 0\), and observations with \(y_i = -1\) are on the other side.
The distance from any point \(\mathbf{x}_0\) to the hyperplane is
\[\frac{1}{\lVert \boldsymbol \beta\lVert} |f(\mathbf{x}_0)|\] For \(p = 2\), and the plane \(\beta_0 + \beta_1 X_1 + \beta_2X_2 = 0\), the distance is \[ \frac{ |\beta_0 + \beta_1 x_{01} + \beta_2x_{02}|}{\sqrt{\beta_1^2 + \beta^2_2}}\]
\[\begin{align} \underset{\boldsymbol \beta, \beta_0}{\text{max}} \quad & M \\ \text{s.t.} \quad & \frac{1}{\lVert \boldsymbol \beta\lVert} y_i(\mathbf{x}' \boldsymbol \beta+ \beta_0) \geq M, \,\, i = 1, \ldots, n. \end{align}\]
The constraint requires that each point be on the correct side of the hyperplane, with some cushion.
The scale of \(\boldsymbol \beta\) can be arbitrary, so just set it as \(\lVert \boldsymbol \beta\rVert = 1\):
\[\begin{align} \underset{\boldsymbol \beta, \beta_0}{\text{max}} \quad & M \\ \text{s.t.} \quad & \lVert \boldsymbol \beta\lVert = 1, \\ \quad & y_i(\mathbf{x}' \boldsymbol \beta+ \beta_0) \geq M, \,\, i = 1, \ldots, n. \end{align}\]
Often, no separating hyperplane exists, so there is no maximal margin classifier.
The previous optimization problem has no solution with \(M > 0\).
Idea: develop a hyperplane that almost separates the classes, using a so-called soft margin: soft margin classifier.
Even if a separating hyperplane does exist, the maximum-margin classifier might not be desirable.
The maximal margin hyperplane is extremely sensitive to a change in a single observation: it may overfit the training data. (low-bias high-variance)
Consider a classifier based on a hyperplane that does NOT perfectly separate the two classes, but
It could be worthwhile to misclassify a few training points in order to do a better job in classifying the remaining observations.
Allow some points to be on the incorrect side of the margin, or even the incorrect side of the hyperplane (training points misclassified by the classifier).
\[\begin{align} \underset{\boldsymbol \beta, \beta_0, \epsilon_1, \dots, \epsilon_n}{\text{max}} \quad & M \\ \text{s.t.} \quad & \lVert \boldsymbol \beta\lVert = 1, \\ \quad & y_i(\mathbf{x}' \boldsymbol \beta+ \beta_0) \geq M(1 - \epsilon_i), \\ \quad & \epsilon_i \ge 0, \sum_{i=1}^n\epsilon_i \le B, \,\, i = 1, \ldots, n, \end{align}\] where \(B > 0\) is a tuning parameter.
\(\epsilon_1, \dots, \epsilon_n\) are slack variables that allow individual points to be on slack the wrong side of the margin or the hyperplane.
The \(i\)th point is on the
\[\begin{align} \underset{\boldsymbol \beta, \beta_0, \epsilon_1, \dots, \epsilon_n}{\text{max}} \quad & M \\ \text{s.t.} \quad & \lVert \boldsymbol \beta\lVert = 1, \\ \quad & y_i(\mathbf{x}' \boldsymbol \beta+ \beta_0) \geq M(1 - \epsilon_i), \\ \quad & \epsilon_i \ge 0, \sum_{i=1}^n\epsilon_i \le B, \,\, i = 1, \ldots, n, \end{align}\] where \(B > 0\) is a tuning parameter.
Warning
The argument cost
in e1071::svm()
is the \(C\) defined in the primal form \[\begin{align}
\underset{\boldsymbol \beta, \beta_0}{\text{min}} \quad & \frac{1}{2}\lVert \boldsymbol \beta\rVert^2 + C \sum_{i=1}^n \epsilon_i \\
\text{s.t} \quad & y_i (\mathbf{x}_i' \boldsymbol \beta+ \beta_0) \geq (1 - \epsilon_i), \\
\text{} \quad & \epsilon_i \geq 0, \,\, i = 1, \ldots, n,
\end{align}\]
so small cost \(C\) means larger budget \(B\).
Note
SVM decision rule is based only on a subset of the training data (robust to the behavior of data that are far away from the hyperplane.)
LDA depends on the mean of all of the observations within each class, and within-class covariance matrix computed using all of the data.
Logistic regression, unlike LDA, is insensitive to observations far from the decision boundary too.
The soft margin classifier is a natural approach for classification in the two-class setting, if the boundary between the two classes is linear.
In practice we are often faced with non-linear class boundaries.
In regression, we enlarge the feature space using functions of the predictors to address this non-linearity.
In SVM (logistic regression too!), we could address non-linear boundaries by enlarging the feature space.
For example, rather than fitting a support vector classifier using \(p\) features, \(X_1, \dots, X_p\), we could instead fit a support vector classifier using \(2p\) features \(X_1,X_1^2,X_2,X_2^2, \dots , X_p, X_p^2\).
\[\begin{align} \underset{\beta_0, \beta_{11}, \beta_{12}, \dots, \beta_{p1}, \beta_{p2}, \epsilon_1, \dots, \epsilon_n}{\text{max}} \quad & M \\ \text{s.t.} \quad & y_i\left(\beta_0 + \sum_{i = 1}^n \beta_{j1}x_{ij} + \sum_{i = 1}^n \beta_{j2}x_{ij}^2\right) \geq M(1 - \epsilon_i), \\ \quad & \epsilon_i \ge 0, \sum_{i=1}^n\epsilon_i \le B, \,\, i = 1, \ldots, n,\\ \quad & \sum_{j=1}^p\sum_{k=1}^2\beta_{jk}^2 = 1. \end{align}\]
The solution to the support vector classifier optimization involves only the inner products of the observations: \(\langle \mathbf{x}_i, \mathbf{x}_{i'} \rangle = \sum_{j=1}^px_{ij}x_{i'j}\)
The linear support vector classifier can be represented as \[f(\mathbf{x}) = \beta_0 + \sum_{i\in \mathcal{S}}\alpha_i\langle \mathbf{x}, \mathbf{x}_{i} \rangle\] where \(\mathcal{S}\) is the collection of indices of the support points.
\(\alpha_i\) is nonzero only for the support vectors in the solution
The support vector machine (SVM) is an extension of the support vector classifier that results from enlarging the feature space in a specific way, using kernels.
The kernel approach is an efficient computational approach for enlarging our feature space and non-linear boundary.
Kernel Trick: \[f(\mathbf{x}) = \beta_0 + \sum_{i\in \mathcal{S}}\alpha_i K\left(\mathbf{x}, \mathbf{x}_{i}\right) \]
Linear kernel: \(K\left(\mathbf{x}_0, \mathbf{x}_{i}\right) = \langle \mathbf{x}_0, \mathbf{x}_{i'} \rangle = \sum_{j=1}^px_{0j}x_{ij}\)
Polynomial kernel: \(K\left(\mathbf{x}_0, \mathbf{x}_{i}\right) = \left(1 + \sum_{j=1}^px_{0j}x_{ij}\right)^d\)
Radial kernel: \(K\left(\mathbf{x}_0, \mathbf{x}_{i'}\right) = \exp \left(-\gamma\sum_{j=1}^p (x_{0j}-x_{ij})^2 \right)\)
\[\begin{align} \underset{\boldsymbol \beta, \beta_0, \epsilon_1, \dots, \epsilon_n}{\text{max}} \quad & M \\ \text{s.t.} \quad & \lVert \boldsymbol \beta\lVert = 1, \\ \quad & y_i(\mathbf{x}_i' \boldsymbol \beta+ \beta_0) \geq M(1 - \epsilon_i), \\ \quad & \epsilon_i \ge 0, \sum_{i=1}^n\epsilon_i \le B, \,\, i = 1, \ldots, n, \end{align}\]
\[\begin{align} \underset{\boldsymbol \beta, \beta_0}{\text{min}} \left\{ \sum_{i=1}^n \max \left[ 0, 1 - y_i (\mathbf{x}_i' \boldsymbol \beta+ \beta_0) \right] + \lambda \lVert \boldsymbol \beta\lVert ^ 2 \right\} \end{align}\] where \(\sum_{i=1}^n \max \left[ 0, 1 - y_i (\mathbf{x}' \boldsymbol \beta+ \beta_0) \right]\) is known as hinge loss.
Large \(\lambda\) (large \(B\)): small \(\beta_j\)s, high-bias and low-variance.
Small \(\lambda\) (small \(B\)): low-bias and high-variance.
The hinge loss is zero for observations for which \(y_i (\mathbf{x}' \boldsymbol \beta+ \beta_0) \ge 1\) (correct side of the margin).
The logistic loss is not zero anywhere.
SVM is better when classes are well separated.
Logistic regression is preferred in more overlapping regimes.