The trainning pointes with equality constraints yi(xiTw+b)≥M are called support vectors. We have Support Vector Machine (SVM) which is a classifier that finds the optimal hyperplane that separates the classes. SVM-like algorithms are often called max-margin or large-margin. Since the Primal-formulation is convex specially is a quadratic program. We can use SGD/GD to solve it. And its more common to solved by dual formulation.
- Primal problem is a minimization problem
- Dual problem is a maximization problem
When the data is linearly separable, and we don't want to have any misclassifications, we will use SVM with a hard margin. When a linear boundary is not feasible or we are allowed to have some misclassifications in hope of achieving better generalitiy, we can opt for a soft margin.
Hard Margin
We can write the Lagrangian function L(w,b,α)=∥w∥22+∑i=1nαi[1−yi(xiTw+b)] where αi≥0 and yi(xiTw+b)≥1 (in case do minimization). We then do derivative w.r.t the w and b and set them to 0 to get the optimal w and b where w=∑i=1nαiyixi and ∑i=1nαiyi=0. And then we insert back to get the dual problem solve αmax∑i=1nαi−41∑i=1n∑j=1nαiαjyiyjxiTxj.
The K.K.T. conditions ensure the relationship between the primal and dual formulations where:
- both optimal objective values are equal
- both optimal solutions of w^ and α^ satisfy w^=21∑i=1nα^iyixi{α^i≥0α^i=0if yi(w^Txi+b^)=1if yi(w^Txi+b^)>1
- The predicted label for any x is sign(w^Tx+b^)
Soft Margin
We first define a slack variables ζ=(ζ1,…,ζn) and consider the Primal problem w,b,ζ1,…,ζnmin∥w∥22 s.t. yi(wTxi+b)≥1−ζi,ζi≥0,∑i=1nζi≤K
- Misclassification occurs if ζi>1.
- ∑i=1nζi≤K restricts the total number of misclassified points less than K
- K=0⟹ hard margin problem
- K≥0 is a tuning paramenter
We can have a equivalent problem w,b,ζmin∥w∥22+C∑i=1nζi s.t. yi(wTxi+b)≥1−ζi,ζi≥0 for some C=C(K) and problem w,b,ζmin{n1∑i=1nmax{0,1−yi(wTxi+b)}}+λ∥w∥22 with λ=nC1
- max{0,1−yi(wTxi+b)} is hinge loss function.
- this soft-margin SVM can be seen as a linear classifier with a hinge loss function and an ℓ2 regularization term.
It can be shown that the dual-formulation of the soft-margin SVM is αmax∑i=1nαi−41∑i=1n∑j=1nαiαjyiyjxiTxj s.t. ∑i=1nαiyi=0 and 0≤αi≤C for all i. C is a tuning parameter.
- We can have function map xi into different bases to have non-linear boundary (in xi), denote such as h, s.t. αmax∑i=1nαi−41∑i=1n∑j=1nαiαjyiyjh(xi)Th(xj). Then we can have multiplication K(xi,xj)=h(xi)Th(xj),∀i=j∈{1,…,n}, K is called kernel that quantifies the similarity of two feature vectors.
- We have kernel trick that compute all the pairwise kernel K(xi,xj)
We have some kernel functions:
- Linear kernel: K(xi,xj)=xiTxj with the corresponding h(xi)=xi
- dth-Degree Polynomial: K(xi,xj)=(xiTxj+1)d
- Radial Basis: K(xi,xj)=exp(−γ∥xi−xj∥22) where h(xi) has infinite dimensions.
The classifier based on SVM is sign(w^Tx+b^) which does not estimate the posterior probability.
- no-trivial to generalize notion of a margin to multi-class classification
SVM with more than two classes
Let C={1,2,…,K} be the set of classes. We can use the one-vs-all approach and one-vs-one approach.
One-Versus-One (OVO) approach: Construct (2K) SVMs for each pair of classes.
- for classes ${i,j}, get zk=−1{yk=i}+1{yk=j}, zk∈{−1,1} (use all trainning data (x,y))
- Fit SVM with (xk,zk) for all k∈N
- summarize all value together. For each test point x0 assign it to the majority class predicted by (2K)
One-Versus-All (OVA) approach: Construct K SVMs, one for each class each time:
- for class i, get zk=2{yk=i}−1, zk∈{−1,1} and summarize all value together.
- use zk fit SVM denote its parameter (b^(i),w^(i)) for all i where w^(i) is the weight vector for class i and b^(i) is the bias for class i.
- For each test point x0 assign it to imaxb^(i)+x0Tw^(i)