Skip to main content

Probabilistic Principal Component Analysis

Sometimes data is very high dimensional, its important features can be accurately captured in a low dimensional subspace. That is the purpose we use PCA.

Given a data set {x(i)}i=1N\{x^{(i)}\}_{i = 1}^N, where each vector x(i)RDx^{(i)} \in \mathbb{R}^D s.t. x(i)x+Uz(i)x^{(i)} \approx \overline{x} + Uz^{(i)}, where x=1Ni=1Nx(i)\overline{x} = \frac{1}{N}\sum_{i = 1}^Nx^{(i)} is the mean of the data set, URD×KU \in \mathbb{R}^{D \times K} is the orthogonal basis matrix of the principal subspace, and z(i)RKz^{(i)} \in \mathbb{R}^K is the code vector.

Since we have z(i)=U(x(i)x)z^{(i)} = U^{\top}(x^{(i)} - \overline{x}), we can rewrite the data set as x(i)=x+UU(x(i)x)x^{(i)} = \overline{x} + UU^{\top}(x^{(i)} - \overline{x}), that is, we choose UU by minimize the reconstruction error i.e. U=argminUi=1Nx(i)xUU(x(i)x)2U^* = \arg\min_{U} \sum_{i = 1}^N\|x^{(i)} - \overline{x} - UU^{\top}(x^{(i)} - \overline{x})\|^2.

PPCA on Gaussian Data

We use PPCA when the situation that Gaussian latent variable model p(x)=zp(x,z)p(x) = \int_z p(x, z) used for dimensionality reduction.

Consider the latent variable model, s.t. zN(0,IK),xzN(Wz+μ,σ2ID)z \sim \mathcal{N}(0, I_K), x|z \sim \mathcal{N}(Wz + \mu, \sigma^2I_D).

Again and again, if we want to estimate the parameters of the model, we do the mle that is, maxW,μ,σ2logp(xW,μ,σ2)=maxW,μ,σ2logp(xz,W,μ,σ2)p(z)dz\max_{W, \mu, \sigma^2} \log p(x|W, \mu, \sigma^2) = \max_{W, \mu, \sigma^2} \log \int p(x|z, W, \mu, \sigma^2)p(z)dz

Since xzN(Wz+μ,σ2ID)x|z \sim \mathcal{N}(Wz + \mu, \sigma^2I_D), we can write x=Wz+μ+ϵx = Wz + \mu + \epsilon. Then p(xW,μ,σ2)p(x|W, \mu, \sigma^2) is also a Gaussian distribution, and

  • E[x]=E[Wz+μ+ϵ]=E[Wz]+μ=μE[x] = E[Wz + \mu + \epsilon] = E[Wz] + \mu = \mu
  • Cov[x]=E[(xE[x])(xE[x])]=E[(Wz+μ+ϵμ)(Wz+μ+ϵμ)]=E[(Wz+ϵ)(Wz+ϵ)]=E[WzzW]+E[ϵϵ]=WE[zz]W+σ2ID=WW+σ2IDCov[x] = E[(x - E[x])(x - E[x])^{\top}] = E[(Wz + \mu + \epsilon - \mu)(Wz + \mu + \epsilon - \mu)^{\top}] = E[(Wz + \epsilon)(Wz + \epsilon)^{\top}] = E[Wzz^{\top}W^{\top}] + E[\epsilon\epsilon^{\top}] = WE[zz^{\top}]W^{\top} + \sigma^2I_D = WW^{\top} + \sigma^2I_D
    • WW=(WR)(WR)WW^{\top} = (WR)(WR)^{\top} leads model is not identifiable
  • that is, p(xW,μ,σ2)=N(xμ,WW+σ2ID)=1(2π)D/2Cov(x)1/2exp(12(xμ)Cov[x]1(xμ))p(x|W, \mu, \sigma^2) = \mathcal{N}(x| \mu, WW^{\top} + \sigma^2I_D) = \frac{1}{(2\pi)^{D/2}|Cov(x)|^{1/2}}\exp(-\frac{1}{2}(x - \mu)^{\top}Cov[x]^{-1}(x - \mu))

The posterior distribution p(zx)p(z|x) given:

  • E[zx]=(WW+σ2IK)1W(xμ)E[z|x] = (W^{\top}W + \sigma^2I_K)^{-1}W^{\top}(x - \mu)
    • limσ20E[zx]=W(xμ)\lim_{\sigma^2 \to 0}E[z|x] = W^{\top}(x - \mu) that is, the projection of xx onto the principal subspace
  • Cov[zx]=σ2(WW+σ2IK)1Cov[z|x] = \sigma^2(W^{\top}W + \sigma^2I_K)^{-1}

Then the log likelihood is: logp(xW,μ,σ2)=i=1Nlogp(x(i)W,μ,σ2)=ND2log(2π)N2log(Cov[x])12i=1N(x(i)μ)Cov[x]1(x(i)μ)\log p(x|W, \mu, \sigma^2) = \sum_{i = 1}^N \log p(x^{(i)}|W, \mu, \sigma^2) = -\frac{ND}{2}\log(2\pi) - \frac{N}{2}\log(|Cov[x]|) - \frac{1}{2}\sum_{i = 1}^N(x^{(i)} - \mu)^{\top}Cov[x]^{-1}(x^{(i)} - \mu)

Here MLE is given in a closed-form where:

  • u^=1Ni=1Nx(i)=x\hat{u} = \frac{1}{N}\sum_{i = 1}^Nx^{(i)} = \overline{x}
  • σ^2=1DKi=K+1Dλi\hat{\sigma}^2 = \frac{1}{D-K}\sum_{i = K + 1}^D\lambda_i
  • W^=U^(Λ^σ^2IK)1/2R\widehat{W} = \widehat{U}(\widehat{\Lambda} - \hat{\sigma}^2I_K)^{1/2} R (i.e. the decomposition)
    • The columns of U^RD×K\widehat{U} \in\R^{D\times K} are the KK unit eigenvectors of the empirical covariance matrix Σ^\widehat{\Sigma} that have the largest eigenvalues,
    • λ1λ2λD\lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_D are the eigenvalues of Σ^\widehat{\Sigma}.
    • Λ^=diag(λ1,...,λK)\widehat{\Lambda} = diag(\lambda_1, . . . , \lambda_K) is the diagonal matrix whose elements are the corresponding eigenvalues, and RR is any orthogonal matrix.
  • Cov^[x]=W^W^+σ^2ID=U^(Λ^σ^2IK)1/2RR(Λ^σ^2IK)1/2U^+σ^2ID=U^(Λ^σ^2IK)U^+σ^2ID\widehat{Cov}[x] = \widehat{W}\widehat{W}^{\top} + \hat{\sigma}^2I_D = \widehat{U}(\widehat{\Lambda} - \hat{\sigma}^2I_K)^{1/2} R R^{\top}(\widehat{\Lambda} - \hat{\sigma}^2I_K)^{1/2} \widehat{U}^{\top} + \hat{\sigma}^2I_D = \widehat{U}(\widehat{\Lambda} - \hat{\sigma}^2I_K)\widehat{U}^{\top} + \hat{\sigma}^2I_D

Why this PPCA captures the data well? Let see:

  • Let uiu_i be the unit eigenvectors which are the vectors forming the columns of U^\widehat{U}, then Var[ui(xx)]=uiCov^[x]ui=uiU^(Λ^σ^2IK)U^ui+σ^2=λiσ^2+σ^2=λiVar[u_i^{\top}(x - \overline{x})] = u_i^{\top}\widehat{Cov}[x]u_i = u_i^{\top}\widehat{U}(\widehat{\Lambda} - \hat{\sigma}^2I_K)\widehat{U}^{\top}u_i + \hat{\sigma}^2 = \lambda_i - \hat{\sigma}^2 + \hat{\sigma}^2 = \lambda_i
  • Let uiu_i be the unit vectors orthogonal to the subspace spanned by U^\widehat{U}, then Var[ui(xx)]=uiCov^[x]ui=uiU^(Λ^σ^2IK)U^ui+σ^2=σ^2Var[u_i^{\top}(x - \overline{x})] = u_i^{\top}\widehat{Cov}[x]u_i = u_i^{\top}\widehat{U}(\widehat{\Lambda} - \hat{\sigma}^2I_K)\widehat{U}^{\top}u_i + \hat{\sigma}^2 = \hat{\sigma}^2

Why PPCA on Gaussian Data?

Fitting a full-covariance Gaussian model of data requires D(D+1)/2+DD(D + 1)/2 + D parameters. With PPCA we model only the KK most significant correlations and this only requires O(KD)O(KD) parameters as long as K is small.

Bayesian PCA gives us a Bayesian method for determining the low dimensional principal subspace.

Existence of likelihood functions allows direct comparison with other probabilistic models.

Instead of solving directly, we can also use EM. The EM can be scaled to very large high-dimensional datasets.