Skip to main content

Support Vector Machine

The trainning pointes with equality constraints yi(xiTw+b)My_i(x_i^Tw + b) \ge M are called support vectors. We have Support Vector Machine (SVM) which is a classifier that finds the optimal hyperplane that separates the classes. SVM-like algorithms are often called max-margin or large-margin. Since the Primal-formulation is convex specially is a quadratic program. We can use SGD/GD to solve it. And its more common to solved by dual formulation.

  • Primal problem is a minimization problem
  • Dual problem is a maximization problem

When the data is linearly separable, and we don't want to have any misclassifications, we will use SVM with a hard margin. When a linear boundary is not feasible or we are allowed to have some misclassifications in hope of achieving better generalitiy, we can opt for a soft margin.

Hard Margin

We can write the Lagrangian function L(w,b,α)=w22+i=1nαi[1yi(xiTw+b)]L(w, b, \alpha) = \|w\|_2^2 + \sum_{i=1}^n \alpha_i[1-y_i(x_i^Tw + b)] where αi0\alpha_i \ge 0 and yi(xiTw+b)1y_i(x_i^Tw + b) \ge 1 (in case do minimization). We then do derivative w.r.t the ww and bb and set them to 0 to get the optimal ww and bb where w=i=1nαiyixiw = \sum_{i=1}^n \alpha_i y_i x_i and i=1nαiyi=0\sum_{i=1}^n \alpha_i y_i = 0. And then we insert back to get the dual problem solve maxαi=1nαi14i=1nj=1nαiαjyiyjxiTxj\max\limits_{\alpha}\sum_{i=1}^n \alpha_i - \frac{1}{4}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j x_i^Tx_j.

The K.K.T. conditions ensure the relationship between the primal and dual formulations where:

  • both optimal objective values are equal
  • both optimal solutions of w^\hat w and α^\hat \alpha satisfy w^=12i=1nα^iyixi{α^i0if yi(w^Txi+b^)=1α^i=0if yi(w^Txi+b^)>1\hat w = \frac{1}{2} \sum_{i=1}^n \hat \alpha_i y_i x_i \begin{cases} \hat \alpha_i \ge 0 & \text{if } y_i(\hat w^Tx_i + \hat b) = 1 \\ \hat \alpha_i = 0 & \text{if } y_i(\hat w^Tx_i + \hat b) > 1 \end{cases}
  • The predicted label for any xx is sign(w^Tx+b^)\text{sign}(\hat w^Tx + \hat b)

Soft Margin

We first define a slack variables ζ=(ζ1,,ζn)\zeta=(\zeta_1,\ldots,\zeta_n) and consider the Primal problem minw,b,ζ1,,ζnw22\min\limits_{w,b,\zeta_1,\ldots,\zeta_n} \|w\|_2^2 s.t. yi(wTxi+b)1ζi,ζi0,i=1nζiKy_i (w^Tx_i + b) \ge 1 - \zeta_i, \zeta_i \ge 0, \sum_{i=1}^n\zeta_i \le K

  • Misclassification occurs if ζi>1\zeta_i > 1.
  • i=1nζiK\sum_{i=1}^n\zeta_i\le K restricts the total number of misclassified points less than KK
  • K=0    K=0\implies hard margin problem
  • K0K\ge 0 is a tuning paramenter

We can have a equivalent problem minw,b,ζw22+Ci=1nζi\min\limits_{w,b,\zeta} \|w\|_2^2 + C\sum_{i=1}^n\zeta_i s.t. yi(wTxi+b)1ζi,ζi0y_i (w^Tx_i + b) \ge 1 - \zeta_i, \zeta_i \ge 0 for some C=C(K)C = C(K) and problem minw,b,ζ{1ni=1nmax{0,1yi(wTxi+b)}}+λw22\min\limits_{w,b,\zeta} \{\frac{1}{n}\sum_{i=1}^n\max\{0, 1 - y_i (w^Tx_i + b)\}\} + \lambda \|w\|_2^2 with λ=1nC\lambda = \frac{1}{nC}

  • max{0,1yi(wTxi+b)}\max\{0, 1 - y_i (w^Tx_i + b)\} is hinge loss function.
  • this soft-margin SVM can be seen as a linear classifier with a hinge loss function and an 2\ell_2 regularization term.

It can be shown that the dual-formulation of the soft-margin SVM is maxαi=1nαi14i=1nj=1nαiαjyiyjxiTxj\max\limits_{\alpha}\sum_{i=1}^n \alpha_i - \frac{1}{4}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j x_i^Tx_j s.t. i=1nαiyi=0\sum_{i=1}^n \alpha_i y_i = 0 and 0αiC0 \le \alpha_i \le C for all ii. CC is a tuning parameter.

  • We can have function map xix_i into different bases to have non-linear boundary (in xix_i), denote such as hh, s.t. maxαi=1nαi14i=1nj=1nαiαjyiyjh(xi)Th(xj)\max\limits_{\alpha}\sum_{i=1}^n \alpha_i - \frac{1}{4}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j h(x_i)^Th(x_j). Then we can have multiplication K(xi,xj)=h(xi)Th(xj),ij{1,,n}K(x_i, x_j) = h(x_i)^Th(x_j), \forall i\ne j \in \{1,\ldots,n\}, KK is called kernel that quantifies the similarity of two feature vectors.
    • We have kernel trick that compute all the pairwise kernel K(xi,xj)K(x_i, x_j)

We have some kernel functions:

  • Linear kernel: K(xi,xj)=xiTxjK(x_i, x_j) = x_i^Tx_j with the corresponding h(xi)=xih(x_i) = x_i
  • ddth-Degree Polynomial: K(xi,xj)=(xiTxj+1)dK(x_i, x_j) = (x_i^Tx_j + 1)^d
  • Radial Basis: K(xi,xj)=exp(γxixj22)K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|_2^2) where h(xi)h(x_i) has infinite dimensions.

The classifier based on SVM is sign(w^Tx+b^)\text{sign} (\hat w^Tx + \hat b) which does not estimate the posterior probability.

  • no-trivial to generalize notion of a margin to multi-class classification

SVM with more than two classes

Let C={1,2,,K}C = \{1,2,\ldots,K\} be the set of classes. We can use the one-vs-all approach and one-vs-one approach.

One-Versus-One (OVO) approach: Construct (K2){K \choose 2} SVMs for each pair of classes.

  • for classes ${i,j}, get zk=1{yk=i}+1{yk=j}z_k = -1\{y_k = i\} + 1\{y_k = j\}, zk{1,1}z_k \in \{-1,1\} (use all trainning data (x,yx,y))
  • Fit SVM with (xk,zk)(x_k, z_k) for all kNk \in \N
  • summarize all value together. For each test point x0x_0 assign it to the majority class predicted by (K2){K \choose 2}

One-Versus-All (OVA) approach: Construct KK SVMs, one for each class each time:

  • for class ii, get zk=2{yk=i}1z_k = 2\{y_k = i\} - 1, zk{1,1}z_k \in \{-1,1\} and summarize all value together.
  • use zkz_k fit SVM denote its parameter (b^(i),w^(i))(\hat b^{(i)}, \hat w^{(i)}) for all ii where w^(i)\hat w^{(i)} is the weight vector for class ii and b^(i)\hat b^{(i)} is the bias for class ii.
  • For each test point x0x_0 assign it to maxib^(i)+x0Tw^(i)\max\limits_{i} \hat b^{(i)} + x_0^T\hat w^{(i)}