Discriminant Anaylsis parametrizes the distribution of X∣Y=1 and X∣Y=0.
Logistic regression model may unstable for classes are well separated. Discriminant analysis is more stable.
Suppose we have K classes, C={0,1,2,…,K−1}. Let πk=P[Y=k] be the prior probability a randomly chosen observation comes from the class k
fk(X)=P(X=x∣Y=k) is the density function of X from class k.
By Bayes' theorem, we have the posterior probability:pk(x):=P(Y=k∣X=x)=∑ℓ∈Cfℓ(x)πℓfk(x)πk which is the probaility that an observationn belongs to the kth class given its feature.
For a new point x we classify it by Bayes classifier where argmaxk∈Cpk(x)=argmaxk∈C∑ℓ∈Cfℓ(x)πℓfk(x)πk=argmaxk∈Cfk(x)πk
Linear Discriminant Analysis (LDA) assumes that the distribution of X∣Y=k is multivariate normal for each class and each class with the same standard deviation where σ0=σ1=⋯=σK−1=σ, but may not the same mean μk.
LDA with probability pk(x)=∑ℓ∈C2πσπℓexp(−2σ21(x−μℓ)2)2πσπkexp(−2σ21(x−μk)2)=∑ℓ∈Cπℓexp(−2σ21(x−μℓ)2)πkexp(−2σ21(x−μk)2)
The Bayes rule classifies X=x to argk∈Cmaxpk(x)=argk∈Cmaxlog(πk(x))=argk∈Cmaxσ2μkx−2σ2μk2+log(πk). We define δk(x)=σ2μkx−2σ2μk2+log(πk) which is a linear function.
We call δ discriminant function.
We call the Bayes rule argk∈Cmaxδk(x)=argk∈Cmaxσ2μkx−2σ2μk2+log(πk) the linear discriminant rule.
Some estimation under LDA by training data:
nk=∑i=1nI(yi=k) is the number of observations in class k;
π^k=nnk;
μ^k=nk1∑i=1nI(yi=k)xi;
σ^2=n−K1∑k=1K∑i=1nI(yi=k)(xi−μ^k)2
δ^k(x)=σ^2μ^kx−2σ^2μ^k2+log(π^k)
The LDA classifier assigns x to the class with the largest δ^k(x).
For binary case, if the prior probability of two classes is the same, then we have the Bayes decision boundary is x=2μ0+μ1
Recall the Multivariate Normal distribution: f(x)=(2π)p/2∣Σ∣1/21exp(−21(x−μ)TΣ−1(x−μ)). That is we have the discriminant function: δk(x)=xTΣ−1μk−21μkTΣ−1μk+log(πk)
The Bayes decision boundaries are the set of x for which δk(x)=δℓ(x) for k=ℓ which is linear hyperplanes in x.
Linear Hyperplane means x∈ℜp where {x:wTx+b=0}.
Some estimation under LDA by training data:
nk=∑i=1nI(yi=k) is the number of observations in class k;
For binary case, we can have linear form to present the classifcation where log(1−p1(x)p1(x))=log(p0p1(x))=c0+c1x1+⋯+cpxp which is the same as logistic regression.
LDA use full likelihood based on P(X,Y) (known as generative learning), but logistic regression use conditional likelihood based on P(X∣Y) (known as discriminative learning).
If classes are well-separated, then logistic regression is not advocated
Differ than LDA, QDA(Qudratic Discriminant Analysis) assumes that X∣Y=k is a multivariate normal distribution with different mean and covariances matrix for each class.
The discriminant function for QDA is: δk(x)=−21xTΣk−1x+xTΣk−1μk−21μkTΣk−1μk+log(πk)−21log(∣Σk∣)
−21xTΣk−1x and −21log(∣Σk∣) are the two terms that are not in LDA.
Some estimation under LDA by training data:
nk=∑i=1nI(yi=k) is the number of observations in class k;