Skip to main content

Discriminant Analysis

Discriminant Anaylsis parametrizes the distribution of XY=1X | Y = 1 and XY=0X | Y = 0.

  • Logistic regression model may unstable for classes are well separated. Discriminant analysis is more stable.

Suppose we have KK classes, C={0,1,2,,K1}C = \{0, 1, 2, \dots, K - 1\}. Let πk=P[Y=k]\pi_k = P[Y = k] be the prior probability a randomly chosen observation comes from the class kk

  • fk(X)=P(X=xY=k)f_k(X) = P(X= x | Y = k) is the density function of XX from class kk.

By Bayes' theorem, we have the posterior probability: pk(x):=P(Y=kX=x)=fk(x)πkCf(x)πp_k(x) := P(Y = k | X = x) = \frac{f_k(x)\pi_k}{\sum_{\ell \in \mathcal{C}}f_{\ell}(x)\pi_{\ell}} which is the probaility that an observationn belongs to the kth class given its feature.

For a new point xx we classify it by Bayes classifier where argmaxkCpk(x)=argmaxkCfk(x)πkCf(x)π=argmaxkCfk(x)πk\arg\max_{k \in \mathcal{C}}p_k(x) = \arg\max_{k \in \mathcal{C}}\frac{f_k(x)\pi_k}{\sum_{\ell \in \mathcal{C}}f_{\ell}(x)\pi_{\ell}} = \arg\max_{k \in \mathcal{C}} f_k(x)\pi_k

LDA

Linear Discriminant Analysis (LDA) assumes that the distribution of XY=kX| Y = k is multivariate normal for each class and each class with the same standard deviation where σ0=σ1==σK1=σ\sigma_0 = \sigma_1 = \dots = \sigma_{K - 1} = \sigma, but may not the same mean μk\mu_k.

LDA with probability pk(x)=πk2πσexp(12σ2(xμk)2)Cπ2πσexp(12σ2(xμ)2)=πkexp(12σ2(xμk)2)Cπexp(12σ2(xμ)2)p_k(x) = \frac{\frac{\pi_k}{\sqrt{2\pi}\sigma} \exp(-\frac{1}{2\sigma^2}(x - \mu_k)^2)}{\sum_{\ell \in \mathcal{C}}\frac{\pi_{\ell}}{\sqrt{2\pi}\sigma} \exp(-\frac{1}{2\sigma^2}(x - \mu_{\ell})^2)} = \frac{\pi_k \exp(-\frac{1}{2\sigma^2}(x - \mu_k)^2)}{\sum_{\ell \in \mathcal{C}}\pi_{\ell} \exp(-\frac{1}{2\sigma^2}(x - \mu_{\ell})^2)}

The Bayes rule classifies X=xX=x to argmaxkCpk(x)=argmaxkClog(πk(x))=argmaxkCμkσ2xμk22σ2+log(πk)\arg\max\limits_{k \in \mathcal{C}}p_k(x) = \arg\max\limits_{k \in \mathcal{C}} \log (\pi_k(x)) = \arg\max\limits_{k \in \mathcal{C}} \frac{\mu_k}{\sigma^2}x - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k). We define δk(x)=μkσ2xμk22σ2+log(πk)\delta_k(x) = \frac{\mu_k}{\sigma^2}x - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k) which is a linear function.

  • We call δ\delta discriminant function.
  • We call the Bayes rule argmaxkCδk(x)=argmaxkCμkσ2xμk22σ2+log(πk)\arg\max\limits_{k \in \mathcal{C}}\delta_k(x) = \arg\max\limits_{k \in \mathcal{C}} \frac{\mu_k}{\sigma^2}x - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k) the linear discriminant rule.

Some estimation under LDA by training data:

  • nk=i=1nI(yi=k)n_k = \sum_{i = 1}^nI(y_i = k) is the number of observations in class kk;
  • π^k=nkn\hat{\pi}_k = \frac{n_k}{n};
  • μ^k=1nki=1nI(yi=k)xi\hat{\mu}_k = \frac{1}{n_k}\sum_{i = 1}^nI(y_i = k)x_i;
  • σ^2=1nKk=1Ki=1nI(yi=k)(xiμ^k)2\hat{\sigma}^2 = \frac{1}{n - K}\sum_{k = 1}^K\sum_{i = 1}^nI(y_i = k)(x_i - \hat{\mu}_k)^2
  • δ^k(x)=μ^kσ^2xμ^k22σ^2+log(π^k)\hat{\delta}_k(x) = \frac{\hat{\mu}_k}{\hat{\sigma}^2}x - \frac{\hat{\mu}_k^2}{2\hat{\sigma}^2} + \log(\hat{\pi}_k)

The LDA classifier assigns xx to the class with the largest δ^k(x)\hat{\delta}_k(x).

  • For binary case, if the prior probability of two classes is the same, then we have the Bayes decision boundary is x=μ0+μ12x = \frac{\mu_0 + \mu_1}{2}

step:

  1. Estimate μ\mu and σ\sigma.
  2. Plug in above estimators and estimate δk\delta_k.
  3. Classify

LDA on Multivariate Normal (more predictors)

Recall the Multivariate Normal distribution: f(x)=1(2π)p/2Σ1/2exp(12(xμ)TΣ1(xμ))f(x) = \frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}\exp(-\frac{1}{2}(x - \mu)^T\Sigma^{-1}(x - \mu)). That is we have the discriminant function: δk(x)=xTΣ1μk12μkTΣ1μk+log(πk)\delta_k(x) = x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + \log(\pi_k)

The Bayes decision boundaries are the set of xx for which δk(x)=δ(x)\delta_k(x) = \delta_{\ell}(x) for kk \neq \ell which is linear hyperplanes in xx.

  • Linear Hyperplane means xpx\in \Re^p where {x:wTx+b=0}\{x: w^Tx + b= 0\}.

Some estimation under LDA by training data:

  • nk=i=1nI(yi=k)n_k = \sum_{i = 1}^nI(y_i = k) is the number of observations in class kk;
  • π^k=nkn\hat{\pi}_k = \frac{n_k}{n};
  • μ^k=1nki=1nI(yi=k)xi\hat{\mu}_k = \frac{1}{n_k}\sum_{i = 1}^nI(y_i = k)x_i;
  • Σ^=1nKk=1Ki=1nI(yi=k)(xiμ^k)(xiμ^k)T\hat{\Sigma} = \frac{1}{n - K}\sum_{k = 1}^K\sum_{i = 1}^nI(y_i = k)(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T
  • δ^k(x)=xTΣ^1μ^k12μ^kTΣ^1μ^k+log(π^k)\hat{\delta}_k(x) = x^T\hat{\Sigma}^{-1}\hat{\mu}_k - \frac{1}{2}\hat{\mu}_k^T\hat{\Sigma}^{-1}\hat{\mu}_k + \log(\hat{\pi}_k)

For binary case, we can have linear form to present the classifcation where log(p1(x)1p1(x))=log(p1(x)p0)=c0+c1x1++cpxp\log (\frac{p_1(x)}{1-p_1(x)}) =\log (\frac{p_1(x)}{p_0}) = c_0 + c_1x_1 + \dots + c_px_p which is the same as logistic regression.

LDA V.S. Logistic Regression

  • LDA makes more assumption by specifiying XYX|Y.
  • LDA use full likelihood based on P(X,Y)P(X, Y) (known as generative learning), but logistic regression use conditional likelihood based on P(XY)P(X|Y) (known as discriminative learning).
  • If classes are well-separated, then logistic regression is not advocated

QDA

Differ than LDA, QDA(Qudratic Discriminant Analysis) assumes that XY=kX|Y = k is a multivariate normal distribution with different mean and covariances matrix for each class.

The discriminant function for QDA is: δk(x)=12xTΣk1x+xTΣk1μk12μkTΣk1μk+log(πk)12log(Σk)\delta_k(x) = -\frac{1}{2}x^T\Sigma_k^{-1}x + x^T\Sigma_k^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma_k^{-1}\mu_k + \log(\pi_k) - \frac{1}{2}\log(|\Sigma_k|)

  • 12xTΣk1x-\frac{1}{2}x^T\Sigma_k^{-1}x and 12log(Σk)- \frac{1}{2}\log(|\Sigma_k|) are the two terms that are not in LDA.

Some estimation under LDA by training data:

  • nk=i=1nI(yi=k)n_k = \sum_{i = 1}^nI(y_i = k) is the number of observations in class kk;
  • π^k=nkn\hat{\pi}_k = \frac{n_k}{n};
  • μ^k=1nki=1nI(yi=k)xi\hat{\mu}_k = \frac{1}{n_k}\sum_{i = 1}^nI(y_i = k)x_i;
  • Σ^k=1nk1i=1nI(yi=k)(xiμ^k)(xiμ^k)T\hat{\Sigma}_k = \frac{1}{n_k - 1}\sum_{i = 1}^nI(y_i = k)(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T
  • δ^k(x)=12xTΣk^1x+xTΣk^1μ^k12μ^kTΣk^1μ^k+log(π^k)12log(Σ^k)\hat{\delta}_k(x) = -\frac{1}{2}x^T\hat{\Sigma_k}^{-1}x + x^T\hat{\Sigma_k}^{-1} \hat{\mu}_k - \frac{1}{2}\hat{\mu}_k^T\hat{\Sigma_k}^{-1}\hat{\mu}_k + \log(\hat{\pi}_k) - \frac{1}{2}\log(|\hat{\Sigma}_k|)

LDA have (K1)+pK+p(p+1)/2(K-1) + pK + p(p+1)/2 parameters to estimate, and QDA (K1)+pK+p(p+1)K/2(K-1) + pK + p(p+1)K/2 parameters to estimate

  • Then estimation error is large when p is large comparing to n

Naive Bayes

Naive Bayes assumes that X1,,XpX_1, \dots, X_p are independent given Y=kY= k. Their covriance matrix is diagonal.

The discriminant function for Naive Bayes is: δk(x)=12j=1p(xjμkj)2σkj2+log(πk)\delta_k(x) = -\frac{1}{2}\sum_{j = 1}^p\frac{(x_j - \mu_{kj})^2}{\sigma_{kj}^2} + \log(\pi_k)

Naive Bayes is easy to extend to mixec features and very useful when p is large comparing to n.

More about dsicriminant analysis

We define False Negative Rate (FNR) where predict false but actualy true and False Positive Rate (FPR) where predict true but actualy false. (similar to TNR and TPR)

When we have a classifier, we can calculate the FNR and FPR. For LDA, we do classification for P(Y=1X)0.5P(Y = 1|X) \ge 0.5. If we have a high FNR, we canvarying the threshold to get a lower FNR where thresh < 0.5.

We also have ROC curve where is a popular graphic for simultaneously displaying FPR and TPR for all possible thresholds. The area under the ROC curve is called AUC. AUC is a measure of the overall performance of the classifier. High AUC is good (=1 the best).

FN is type II error and FP is type I error.