Skip to main content

Multivariate Normal Distribution

Multivariate Normal Distribution is a generalization of the normal distribution to multiple dimensions. It is often a good approximation to the true distribution where by Central Limit Theorem, multivariate normal distribution is the sample distribution of many multivariate random variables.

We define a pp-dimensional random vector X\mathbf{X} has a multivariate normal distribution if every linear combination of its components is normally distributed.

Recall XN(μ,σ)X\sim N(\mu, \sigma) with the PDF of the normal distribution is given by f(x)=12πσ2exp((xμ)22σ2)f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(x-\mu)^2}{2\sigma^2}) the exponential term can be rewritten as (xμ)22σ2=12(xμ)σ2(xμ)-\frac{(x-\mu)^2}{2\sigma^2} = -\frac{1}{2}(x-\mu) \sigma^{-2} (x-\mu), that is, in multivariate way, the exponent part can be write as: 12(xμ)Σ1(xμ)\frac{1}{2}(\mathbf{x}-\mathbf{\mu})'\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})

where μ\mathbf{\mu} is the mean vector, Σ\mathbf{\Sigma} is the p×pp\times p variance-covariance matrix. This matrix is a symmetric positive definite matrix. The PDF of the multivariate normal distribution is given by: f(x)=1(2π)p/2det(Σ)1/2e12(xμ)TΣ1(xμ)f(\mathbf{x}) = \frac{1}{(2\pi)^{p/2}\det(\Sigma)^{1/2}}e^{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})}

  • If eventually we have CXCX where C=0C = 0 vector, then CX=0CX = 0 still is normal distribution.
  • Since each term is independent, the joint distribution is the product of the marginal distribution. f(x)=i=1pf(xi)=1(2π)p/2det(Σ)1/2e12(xμ)TΣ1(xμ)f(\mathbf{x}) = \prod_{i=1}^pf(x_i) = \frac{1}{(2\pi)^{p/2}\det(\Sigma)^{1/2}}e^{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})}

That's, we write the random vector XN(μ,Σ)X \sim \mathcal{N}(\mathbf{\mu}, \mathbf{\Sigma}), where when det(Σ)=0\det(\Sigma)=0, we call it a degenerate distribution. Moreover det(2πΣ)=(2π)pdet(Σ)\det(2\pi\Sigma) = (2\pi)^p\det(\Sigma), so the PDF of the multivariate normal distribution also can write as: f(x)=1det(2πΣ)e12(xμ)TΣ1(xμ)f(\mathbf{x}) = \frac{1}{\sqrt{\det(2\pi\Sigma)}}e^{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})}

  • e.g. for bivariate normal distribution: f(x,y)=12πσ12σ22(1ρ2)e12(1ρ2)[(xμ1)2σ12+(yμ2)2σ222ρ(xμ1)(yμ2)σ1σ2]f(x,y) = \frac{1}{2\pi\sqrt{\sigma_1^2\sigma_2^2(1-\rho^2)}}e^{-\frac{1}{2(1-\rho^2)}\left[\frac{(x-\mu_1)^2}{\sigma_1^2}+\frac{(y-\mu_2)^2}{\sigma_2^2}-\frac{2\rho(x-\mu_1)(y-\mu_2)}{\sigma_1\sigma_2}\right]}. If ρ=0    f(x,y)=f(x)f(y)\rho = 0 \implies f(x,y) = f(x)f(y), so ρ\rho is the correlation where no correlation means the two variables are independent for normal distribution.

Let XNp(μ,Σ)X \sim \mathcal{N}_p(\mathbf{\mu}, \mathbf{\Sigma}) with density function f(x)=1(2π)1/2det(Σ)1/2e12(xμ)TΣ1(xμ)f(\mathbf{x}) = \frac{1}{(2\pi)^{1/2}\det(\Sigma)^{1/2}}e^{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})}, then such XX with probability density contour {x:(xμ)TΣ1(xμ)=c2}\{x: (\mathbf{x}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu}) = c^2\} which is basically the surface of an ellipsoid centered at μ\mu.

  • the axes of each ellipsoid of constant density are in the direction of the eigenvectors of Σ1\Sigma^{-1}
  • their lenghts are proportional to the reciprocals (xx1=idx*x^{-1} = id, xx and x1x^{-1} the reciprocals of each other) of the square roots fo the eigenvalues of Σ1\Sigma^{-1}
  • Since the inverse of positive definite matrix is also positive definite, then we have axes ±cλiei\pm c\sqrt{\lambda_i}e_i where Σei=λiei\Sigma e_i = \lambda_i e_i
  • Since multivariate normal, the constant of (xμ)TΣ1(xμ)(\mathbf{x}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu}) follows χp2\chi^2_p distribution. The probability 1α1 - \alpha is assigned to {x:(xμ)TΣ1(xμ)χp2(α)}\{x: (\mathbf{x}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu}) \leq \chi^2_p(\alpha)\} where χp2(α)\chi^2_p(\alpha) is the 100α100\alphath percentile of χp2\chi^2_p distribution.

Key Properties

Suppose XX follows the multivariate normal distribution with mean vector μ\mathbf{\mu} and variance-covariance matrix Σ\mathbf{\Sigma}, then:

  1. Linear combinations of the components of XX are normally distributed.
  2. All subsets of XX are also multivariate normal distributed.
  3. Zero covariance means implies independent.
  4. The conditional distributions of the components is still multivariate normal distributed.

Let XN(μ,Σ)X \sim \mathcal{N}(\mathbf{\mu}, \mathbf{\Sigma}), then the following are true:

For the 1st., denote aX=a1X1++apXpa'X = a_1X_1 + \cdots + a_pX_p, then aXN(aμ,aΣa)a'X \sim \mathcal{N}(a'\mathbf{\mu}, a'\mathbf{\Sigma}a) e.g. a=[1,0,,0]a = [1, 0, \ldots, 0], then aXN(μ1,σ12)a'X \sim \mathcal{N}(\mu_1, \sigma_1^2) also can prove X1X_1 follows N(μ1,σ12)\mathcal{N}(\mu_1, \sigma_1^2) where aX=X1a'X = X_1.

For the 2nd., we partition XX into X1;q×1X_{1; q\times 1} and X2;(pq)×1X_{2; (p-q)\times 1}, then X=[X1;X2]X = [X_{1}; X_{2}], then X1X_{1} and X2X_{2} are independent, so X1X_{1} and X2X_{2} are multivariate normal distributed, and X1X_{1} is distributed as N(μ1,Σ11)\mathcal{N}(\mu_{1}, \Sigma_{11}) and X2X_{2} is distributed as N(μ2,Σ22)\mathcal{N}(\mu_{2}, \Sigma_{22}) where Σ11\Sigma_{11} is the q×qq\times q submatrix of Σ\Sigma and Σ22\Sigma_{22} is the (pq)×(pq)(p-q)\times (p-q) submatrix of Σ\Sigma. That is, Σ=[Σ11Σ12 Σ21Σ22]\Sigma = \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\\ \Sigma_{21} & \Sigma_{22} \end{bmatrix}, where Σ12\Sigma_{12} is the q×(pq)q\times (p-q) submatrix of Σ\Sigma and Σ21\Sigma_{21} is the (pq)×q(p-q)\times q submatrix of Σ\Sigma. And μ1=(μ11++μ1q)/q\mu_1 = (\mu_{11} + \ldots + \mu_{1q} )/ q and μ2=(μ1(q+1)++μ1p)/(pq)\mu_2 = (\mu_{1(q+1)} + \ldots + \mu_{1p})/(p-q).

  • we can have a matrix Aq×p=[Iq×q0 00]A_{q\times p} = \begin{bmatrix} I_{q\times q} & 0 \\\ 0 & 0 \end{bmatrix} to do the partition where AX=X1AX = X_{1} where AΣAT=Σ11A \Sigma A^T = \Sigma_{11}.

For the 3rd., Σ12=0    Σ11\Sigma_{12} = 0 \iff \Sigma_{11} and Σ22\Sigma_{22} are independent, so X1X_{1} and X2X_{2} are independent, where the joint function f(x1,x2)=f(x1)f(x2)f(x_1, x_2) = f(x_1)f(x_2).

  • For binary random variable, uncorrelated can implies independent.

For the 4th., XX partition into X1,X2X_1,X_2, and Σ22>0\Sigma_{22} > 0, then X1X2=x2N(μ1+Σ12Σ221(x2μ2),Σ11Σ12Σ221Σ21)X_1|X_2 = x_2 \sim N(\mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2), \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}) where X1X2=x2X_1|X_2 = x_2 is the conditional distribution of X1X_1 given X2=x2X_2 = x_2.

  • Let Y1=X1Σ12Σ221X2Y_1 = X_1 - \Sigma_{12}\Sigma_{22}^{-1}X_2 and Y2=x2Y_2 = x_2 then we have [Y1Y2]=[IΣ12Σ2210I][X1X2]\begin{bmatrix} Y_1 \\ Y_2\end{bmatrix} = \begin{bmatrix} I & \Sigma_{12}\Sigma_{22}^{-1} \\ 0 & I \end{bmatrix} \begin{bmatrix} X_1 \\ X_2\end{bmatrix}
  • then Σy=AΣxAT=[IΣ12Σ2210I][Σ11Σ12Σ21Σ22][I0Σ12Σ221I]=[Σ11Σ12Σ221Σ2100Σ22]\Sigma_y = A \Sigma_x A^T = \begin{bmatrix} I & \Sigma_{12}\Sigma_{22}^{-1} \\ 0 & I \end{bmatrix} \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} \begin{bmatrix} I & 0 \\ -\Sigma_{12}\Sigma_{22}^{-1} & I \end{bmatrix} = \begin{bmatrix} \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} & 0 \\ 0 & \Sigma_{22} \end{bmatrix}
  • Y1Y2Y_1 \perp Y_2

Jacobian Matrix

Let U=A(V+C)U = A(V+C) then we have function fU(u)=fV(v)Au=A(V+C)f_U(u) = \frac{f_V(v)}{|A|}|_{u = A(V+C)}

MLE

for random samples X1,,XnX_1, \ldots, X_n, the joint function of all those observations is i=1n1(2πΣ)n/2exp(12(Xiμ)TΣ1(Xiμ))=1(2πΣ)n/2exp(12i=1n(Xiμ)TΣ1(Xiμ))\prod_{i=1}^n \frac{1}{(2\pi\Sigma)^{n/2}} \exp(-\frac{1}{2}(X_i - \mu)^T\Sigma^{-1}(X_i - \mu)) = \frac{1}{(2\pi\Sigma)^{n/2}} \exp(-\frac{1}{2}\sum_{i=1}^n (X_i - \mu)^T\Sigma^{-1}(X_i - \mu)).

  • The sample mean and sample variance xˉ\bar x and SS are sufficient statistics for μ\mu and Σ\Sigma.

To have maximum likelihood estimator for popluation use the samples, we do log derivative to find the maximum point, where the likelihood function after log is log(1(2πΣ)n/2)12i=1n(Xiμ)TΣ1(Xiμ)\log(\frac{1}{(2\pi\Sigma)^{n/2}}) - \frac{1}{2}\sum_{i=1}^n (X_i - \mu)^T\Sigma^{-1}(X_i - \mu).

  • We also have i=1n(Xiμ)TΣ1(Xiμ)=tr[Σ1i=1n(Xiμ)(Xiμ)T]\sum_{i=1}^n (X_i - \mu)^T\Sigma^{-1}(X_i - \mu) = tr[\Sigma^{-1} \sum_{i=1}^n (X_i - \mu)(X_i - \mu)^T]
  • And i=1n(Xiμ)(Xiμ)T=i=1n(xixˉ)(xixˉ)T+n(xˉμ)(xˉμ)T\sum_{i=1}^n (X_i - \mu)(X_i - \mu)^T = \sum_{i=1}^n(x_i - \bar x)(x_i - \bar x)^T + n(\bar x - \mu)(\bar x - \mu)^T
  • Then we can conclude L(μ,Σ)=log(1(2πΣ)n/2)12tr[Σ1i=1n(Xiμ)(Xiμ)T]n2tr[Σ1(xˉμ)(xˉμ)T]=log(1(2πΣ)n/2)12tr[Σ1i=1n(Xiμ)(Xiμ)T]n2(xˉμ)TΣ1(xˉμ)L(\mu, \Sigma) = \log(\frac{1}{(2\pi\Sigma)^{n/2}}) - \frac{1}{2}tr[\Sigma^{-1} \sum_{i=1}^n (X_i - \mu)(X_i - \mu)^T] - \frac{n}{2}tr[\Sigma^{-1}(\bar x - \mu)(\bar x - \mu)^T] = \log(\frac{1}{(2\pi\Sigma)^{n/2}}) - \frac{1}{2}tr[\Sigma^{-1} \sum_{i=1}^n (X_i - \mu)(X_i - \mu)^T] - \frac{n}{2}(\bar x - \mu)^T \Sigma^{-1} (\bar x - \mu)
  • That is, we have estimator of μ\mu is xˉ\bar x and estimator of Σ\Sigma is 1ni=1n(Xixˉ)(Xixˉ)T\frac{1}{n}\sum_{i=1}^n (X_i - \bar x)(X_i - \bar x)^T

Conditional Distribution

Let (X1,,Xn)(X_1, \ldots, X_n) be a random vector follow multivariate normal distribution with mean μ\mu and covariance matrix Σ\Sigma. If we want to calculate the distribution of (Xi...)(Xj...)(X_i ... )| (X_j ...), then we can use the formula μij=μi+ΣijΣjj1(xjμj)\mu_{i|j} = \mu_i + \Sigma_{ij}\Sigma_{jj}^{-1}(x_j - \mu_j) and Σij=ΣiiΣijΣjj1Σji\Sigma_{i|j} = \Sigma_{ii} - \Sigma_{ij}\Sigma_{jj}^{-1}\Sigma_{ji}

Example, Try to prove the following truth

Let XN(0,1)X\sim N(0,1) and H is Bernoulli(0.5)Bernoulli(0.5) with value {1,1}\{1, -1\}, then Y=HXY = HX is also normal distributed. We can prove this by using the fact that Y=HX=H1XY = HX = H\sqrt{1}X where XX is standard normal distributed, so YY is also standard normal distributed.

  • the density of YY is still noraml distributed where F(Y)=P[Yy]=P[HXyH=1]P[H=1]+P[HXyH=1]P[H=1]F(Y) = P[Y \leq y] = P[HX \leq y \| H = 1]P[H = 1] + P[HX \leq y \| H = -1]P[H = -1] =0.5P(XyH=1)+0.5P[XyH=1]=0.5FX(y)+0.5(1FX(y))= 0.5P(X \leq y \| H = 1) + 0.5 P[X \geq -y \| H = -1] = 0.5F_X(y) + 0.5(1- F_X(-y)) where FXF_X is the CDF of XX so that YY and XX have the same distribution.
  • XX and YY is not correlated E[XY]=E[XHX]=E[X2]E[H]=0E[XY] = E[XHX] = E[X^2]E[H] = 0 where E[H]=0.50.5=0E[H] = 0.5 - 0.5 = 0.
  • P[X1]=cP[\|X\| \ge 1] = c, P[Y1]=cP[\|Y\| \ge 1] = c
  • P[X1,Y1]=c=P[Y1X1]P[X1]=cP[\|X\| \ge 1, \|Y\| \ge 1] = c = P[\|Y\| \ge 1 \| \|X\| \ge 1]P[\|X\| \ge 1] = c, that is, XX and YY are not independent (else equal to c2c^2).
  • [X,Y][X, Y]' is not normal distributed, they are normal only when they both follows the bivariate normal distribution.

Let XN3(μ)X \sim \mathcal{N}_3(\mathbf{\mu}) with Σ=[410 130 002]\Sigma = \begin{bmatrix} 4 & 1 & 0 \\\ 1 & 3 & 0 \\\ 0 & 0 & 2\end{bmatrix}.

  • X1X_1 and X2X_2 are dependent
  • (X1,X2)(X_1, X_2) and X3X_3 are independent

Let X=[X1,X2,X3,X4]X = [X_1, X_2, X_3, X_4] with Σ=[3111 1311 1131 1113]\Sigma = \begin{bmatrix} 3 & 1 & 1 & 1 \\\ 1 & 3 & 1 & 1 \\\ 1 & 1 & 3 & 1 \\\ 1 & 1 & 1 & 3 \end{bmatrix}. Let Y=[Y1,Y2,Y3]Y = [Y_1, Y_2, Y_3] where Y1=X1X2Y_1 = X_1 - X_2, Y2=X1+X22X3Y_2 = X_1 + X_2 - 2X_3, Y3=X1+X2+X33X4Y_3 = X_1 + X_2 + X_3 - 3X_4.

  • Y=[Y1 Y2 Y3]=[1100 1120 1113][X1 X2 X3 X4]=[1100 1120 1113]XY = \begin{bmatrix} Y_1 \\\ Y_2 \\\ Y_3\end{bmatrix} = \begin{bmatrix} 1 & -1 & 0 & 0 \\\ 1 & 1 & -2 & 0 \\\ 1 & 1 & 1 & -3 \end{bmatrix} \begin{bmatrix} X_1 \\\ X_2 \\\ X_3 \\\ X_4 \end{bmatrix} = \begin{bmatrix} 1 & -1 & 0 & 0 \\\ 1 & 1 & -2 & 0 \\\ 1 & 1 & 1 & -3 \end{bmatrix} X where A=[1100 1120 1113]A = \begin{bmatrix} 1 & -1 & 0 & 0 \\\ 1 & 1 & -2 & 0 \\\ 1 & 1 & 1 & -3 \end{bmatrix}.
  • We can use ΣY=AΣA=[400 0120 0024]\Sigma_Y = A\Sigma A' = \begin{bmatrix} 4 & 0 & 0 \\\ 0 & 12 & 0 \\\ 0 & 0 & 24 \end{bmatrix} to prove YY is multivariate normal distributed. And Y1,Y2,Y3Y_1,Y_2,Y_3 are uncorrelated. Since each XiX_i is normal distributed, so YiY_i is independent to each other.

To prove E[XAX]=tr[AΣ]+μAμE[X'AX] = tr[A\Sigma] + \mu'A\mu

E[XAX]=tr[E[XAX]]=E[tr[XAX]]=E[tr[AXX]]=tr[E[AXX]]E[X'AX] = tr[E[X'AX]] = E[tr[X'AX]] = E[tr[AXX']] = tr[E[AX'X]] =tr[A[Σμμ]]=tr[AΣ]+tr[Aμμ]= tr[A[\Sigma - \mu\mu']] = tr[A\Sigma] + tr[A\mu\mu'] =tr[AΣ]+tr[μAμ]=tr[AΣ]+μAμ= tr[A\Sigma] + tr[\mu A\mu'] = tr[A\Sigma] + \mu'A\mu.

  • when A=IA = I, then E[XX]=tr[Σ]+μμE[X'X] = tr[\Sigma] + \mu'\mu. where we have Σ=E[XX]μμ\Sigma = E[X'X] - \mu'\mu and tr[Σ]=E[XX]μμtr[\Sigma] = E[X'X] - \mu'\mu.
  • Since 1×p,p×p,p×1 1\times p, p\times p, p\times 1 for each matrix, then with result of 1×11\times 1, that is, trace equal to itself.