Classification
We may have a classification problem for a qualitative result from an unordered set . The main goal of us is to:
- build a classifier to assigns a future observations to a class label. To obtain this function, similarly as the MSE, we have to minimize the expected error rate:
The best classifier is known as the Bayes classifier denote it as which is the classifier we aim for. It defined as if . It always exists in classification problems.
- is the smallest expected error rate among all which is the same as the irreducible error (also known as the bayes error rate).
The reason why we can't use the common regression way (i.e. OLS) to do classification is that for a given feature we have where the coefficients from the regression esitimation which might not what we want.
We have many methods to do classification such as logistic regression, discriminant analysis, KNN, SVM, decision tree, random forest, boosting and neural network.
For binary classification problem , we have the linear decision boundary for some weights and .
- A good decision boundary satisfy and .
- we estimate the and by minimize where is the set of misclassified points.
When the data is separable on higher dimension, we may have multiple solutions of and . That is, we have optimal separating hyperplane where is a hyperplane that separates two classes and maximizes the distance to the closest point from either class.
- the decision hyperplane is orthogonal to where on the hyperplane, .
- we define is a unit vector pointing in the same direction as , same hyperplane will produce same .
- , there exists a on the hyperplane which is the closest point to from hyperplane. If we project onto , we get .
- Then we have a margin constraints for all where if and if .
- We want this margin to be as large as possible so that we can have a good classifier (no predictor fall into the margin). The width of margin define as .
- max is the same as min . We can finally set W.L.O.G.
Comparison of Classification Methods
- SVM is more similar as LR than LDA
- SVM does not estimate the probabilitiees , but LDA and LR do.
- When classes are (nearly) separable, SVM and LDA perform better than LR
- When classes are non-separable, LR (with ridge penalty) and SVM are very similar
- log LR is linear so that we can further do ridge regression
- When Gaussianity can be justified (such as normal assumption is true), LDA has best performance
- SVM annd LR are less used for multi-class classification