Skip to main content

Markov Chain Monte Carlo

Markov chain Monte Carlo (MCMC) is a methodology that use Markov sequences to efficiently model otherwise intractable distributions. That is, Given a probability distribution π\pi, the goal of MCMC is to simulate a random variable XX whose distribution is π\pi. Such distribution may continuous or discrete even though we assume it's discrete.

SLLN

How MCMC works? The SLLN gives strong supports.

Let X1,X2,X_1, X_2, \ldots be a sequence of independent random variables, each having finite mean μ\mu. The Strong Law of Large Numbers (SLLN) states that the averages converges with probability 1 to the common mean μ\mu, denote Xna.s.μ\overline{X_n} \overset{a.s.}{\to} \mu. Or p(limnXn=μ)=1p(\lim_{n \to \infty} \overline{X_n} = \mu) = 1.

Monte Carlo Integration

One of the most common application of MCMC (based on SLLN) is Monte Carlo integration. Suppose we want to compute the integral of a function h(x)h(x) over a bounded interval [a,b][a, b]. We can use the following Monte Carlo estimator: J=abh(x)dx=abw(x)f(x)dx\mathcal{J} = \int_a^b h(x) dx = \int_a^b w(x) f(x)dx where w(x)=h(x)(ba)w(x) = h(x) (b - a) and f(x)=1/(ba)f(x)=1/(b-a). Then J=Ef[w(x)]\mathcal{J} = E_f[w(x)] since XU(a,b)X \sim U(a, b), and then we can generate X1,X2,X_1, X_2, \ldots from U(a,b)U(a, b) so that J^=1ni=1nw(Xi)a.s.J\hat{\mathcal{J}} = \frac{1}{n} \sum_{i=1}^n w(X_i) \overset{a.s.}{\to} \mathcal{J}.

Markov Chain

Another application is the following theorem:

Assume that X0,X1,X_0, X_1, \ldots is an ergodic Markov chain with stationary distribution π\pi. Let ff be a bounded, real-valued function. Let XX be a random variable with distribution π\pi. Then limn1ni=1nf(Xi)=E[f(X)]=jf(j)π(j)\lim_{n\to\infty} \frac{1}{n} \sum_{i=1}^n f(X_i) = E[f(X)] = \sum_j f(j) \pi(j) with probability 1.

Metropolis-Hastings Algorithm

The most common MCMC method is Metropolis-Hastings algorithm. It produces an irreducible, aperiodic Markov chain with stationary distribution π\pi. Specifically:

Let π\pi be a discrete probability distribution. Let PP be a transition matrix for any irreducible Markov chain with the same state space as π\pi. PP used as a proposal chain which generates elements of a sequence that the algorithm decides whether or not to accept (assumed knows how to sample from PP). The algorithm constructs a reversible Markov chain X0,X1,X_0, X_1, \ldots whose stationary distribution is π\pi

To describe the transition mechanism for X0,X1,X_0, X_1, \ldots assume that at time nn the chain is at state ii. The next step of the chain Xn+1X_{n+1} is determined by a two-step procedure.

  1. Choose a new state jj according the PP, that is, with probability pijp_{ij}
  2. Decide whether to accept jj or not. Let α(i,j)=π(j)pjiπ(i)pij\alpha(i,j) = \frac{\pi(j)p_{ji}}{\pi(i)p_{ij}}. If α(i,j)1\alpha(i,j) \geq 1, then accept jj. Otherwise, accept jj with probability α(i,j)\alpha(i,j). If jj is not aaccepted, then we keep ii as the next state of the chain. (i.e. Xn+1={jif Uα(i,j)iif U>α(i,j)X_{n+1} = \begin{cases} j & \text{if } U \le \alpha(i,j) \\ i & \text{if } U > \alpha(i,j) \end{cases})

some remarks

  • uniform π\pi lead α(i,j)=pjipij\alpha(i,j) = \frac{p_{ji}}{p_{ij}}
  • PP is a symmetric matrix, then α(i,j)=π(j)π(i)\alpha(i,j) = \frac{\pi(j)}{\pi(i)}
  • PP is ergodic, then resulting chain is ergodic
  • long run may lead bias and burn-in (i.e. the discarding of the initial iterations)

Gibbs Sampling

The Gibbs sampler is a special case of the Metropolis–Hastings algorithm, with a particular choice of proposal distribution.

Suppose f:S×S(0,)f: S \times S \to (0, \infty) is such that s=(i,j)S×Sf(i,j)<s = \sum_{(i,j) \in S\times S} f(i,j) < \infty. Let g=c1fg = c^{-1} f is the pdf on S×SS\times S. Then we have g1(i)=c1jSf(i,j)g_1(i) = c^{-1} \sum_{j \in S} f(i,j) and g2(i,j)=c1f(i,j)g_2(i,j) = c^{-1} f(i,j). Since hard to generate from g(i,j)g(i,j) which is a higher dimensional distribution, we can use the conditional distribution where g1(ij)=g(i,j)/g2(j)g_1(i|j) = g(i,j)/g_2(j) and g2(i,jk)=g(i,j)/g1(k)g_2(i,j|k) = g(i,j)/g_1(k). Then we can use the following algorithm:

For the nth step with given (Xn1,Yn1)(X_{n-1}, Y_{n-1}):

  1. Generate Vbinomial(1,0.5)V \sim \text{binomial}(1, 0.5)
  2. If V=0V = 0, Put Yn=Yn1Y_n = Y_{n-1} and generate Xng1(.Yn)X_n \sim g_1(.|Y_n) and if V=1V = 1, Put Xn=Xn1X_n = X_{n-1} and generate Yng2(.Xn)Y_n \sim g_2(.|X_n)