We have multiple ways to do sampling.
Ancestral Sampling
The way sampling data based a joint distribution in a topological order is called ancestral sampling.
- at each step, sample from any conditional distribution that you haven't visited yet whose parent have all been sampled
Simple Monte Carlo
We can use Monte Carlo method to solve problems:
- To generate samples {x(r)}r=1R from a given pdf p(x)
- To estimate expectation of functions ϕ(x) (test function) by pdf p(x) i.e. Ex∼p(x)[ϕ(x)]=∫p(x)ϕ(x)dx
We define Simple Monte Carlo as: given {x(r)}r=1R∼p(x) we can estimate the expectation Ex∼p(x)[ϕ(x)]≈R1∑r=1Rϕ(x(r))=Φ^.
- The fact that Φ^ is a consistent estimator of Ex∼p(x)[ϕ(x)] by the law of large numbers (i.e. unbiased E[Φ^]=Φ and V[Φ^]=R1V[Φ])
- Accuracy of the Monte Carlo estimate dependds on the variance of ϕ
Importance Sampling
The above two method works with the given pdf p(x), but what if we don't have the pdf p(x)? We can use importance sampling to solve this problem. Importance sampling is a method to estimate the expectation of a function ϕ(x).
- The pdf we want to draw sampels can be evaluated up to normalizing constant p(x)=p~(x)/Z
- We also can have a simple pdf q(x) that can be evaluated up to normalizing constant q(x)=q~(x)/Zq
In importance sampling:
- we draw R samples from q(x) as {x(r)}r=1R.
- Then we use weights to reduce/importance the values of x from q(x) to p(x) by w~r=q~(x(r))p~(x(r)).
- Based on normalizing property, we can get R1∑r=1Rw~r≈Ex∼q(x)[p~(x)/q~(x)]=∫p~(x)/q~(x)q(x)dx=Zp/Zq.
- Rewrite simple Monte Carlo Φ=∫ϕ(x)p(x)dx=∫ϕ(x)q(x)p(x)q(x)dx≈R1∑r=1Rϕ(x(r))q(x(r))p(x(r))=ZpZqR1∑r=1Rϕ(x(r))q~(x(r))p~(x(r))=R1∑r=1Rϕ(x(r))⋅wr=Φ^iw where wr=∑r=1Rw~rw~r and Φ^iw is the importance weighted estimator of Φ.
Rejection Sampling
We have the other sampling method without using pdf p(x). Similar as importance sampling, we would like to use a simpler propsal density q(x) (can be evaluated within a multiplicative factor Zq). We assume cq~(x)>p~(x) for all x. That is we have rejection sampling procedure follows:
- Generate two random variables
- Denote the one generated from the proposal density q(x) as x.
- Denote the one uniformly generated from the interval [0,cq~(x)] as u.
- Evaluate p~(x) and accept or reject the sample x by comparing the value of u with the value of p~(x).
- If u>p~(x), reject the sample x.
- else, accept the sample x. Add x into set of samples {x(r)}r=1R and discard u.
some facts about rejection sampling:
- high-dimensional problems always lead huge c so that rare to accept samples
- find c is hard
- define acceptance rate =area under p~(x)/area under cq~(x)=1/Z
- the accepted point {x(r)}r=1R are independent.
Markov Chain Monte Carlo (MCMC)
Beyond we first talk about MCMC, we should know what is Markov Chain(MC).
Markov Chain
See sta 447 Markov Chain or sta 347 Markov Chain for more details.
Recall the the definition of a MC is reversible if πiπj=πjπi for all i,j. We define πiπj=πjπi for all i,j as Detailed Balance Equation (DB) in this course.
Metropolis-Hastings Algorithm
As we described above, importance and rejection sampling work only when if the proposal density q(x) is similar to the target density p(x) and we hard to find such q(X) in high-dimensional problems. We can use Metropolis-Hastings Algorithm to solve this problem where we make such q(x) depends on the current state x (i.e. q(x∣x(t))). The algorithm is as follows:
- A tentative new state x′ is generated from the proposal density q(x′∣x(t)). We accept the new state with probability A(x′∣x(t))=min{1,p~(x(t))q(x′∣x(t))p~(x′)q(x(t)∣x′)}.
The facts about this algorithm:
- q(x∣x(t)) is not necessarily looks like p(x)
- This algorithm generalizes a dependent sequence
- Each sample x(t) has a pdf depends on the previous sample x(t−1). (i.e. the markov chain)
- need to be run for a time in order to generate samples from the target distribution p(x), i.e. do Monte Carlo estimation when T is large enough to estimate the mean of test function ϕ: Ex∼p(x)[ϕ(x)]=T1∑t=1Tϕ(x(t))
- The procedure defines a Markov chain with stationary distribution π(x) equal to the target distribution p(x) (i.e. π(x)=p(x))
Gibbs Sampling
Suppose x is a vector of d variables. We can use Gibbs sampling to sample from the joint distribution p(x) by sampling each variable xi in turn. The algorithm is as follows:
- initialize x(0)=(x1(0),x2(0),⋯,xd(0)). for each iteration:
- for j=1,2,⋯,d:
- sample xj(t) from the conditional distribution given other component xj(t)∼p(xj∣x−j(t)) where x−j(t) is the vector of all other components of x(t) except xj(t).
Hamiltonian Monte Carlo
Hamiltonian Monte Carlo (HMC) is a variant of the Metropolis-Hastings algorithm (with a specialized proposal mechanism). This algorithm uses a physical analogy to make proposals (the potential energy). Before we talk about HMC, we should know what is Hamiltonian.
- Momentum ν carrying the kinetic energy K(ν)=21νTν and it follows a Gaussian distribution and independent of position x.
- Totaly enery or Hamiltonian H(x,ν)=E(x)+K(ν).
- Energy is preserved:
- Frictionless ball rolling: (x,ν)→(x′,ν′)
- H(x,ν)=H(x′,ν′)
- Ideal Hamiltonian dynamics are reversible: reverse ν and the ball will return to its start point! (x,−ν)→(x′,−ν′)
- use dtdx=∂ν∂H=∂ν∂K and dtdν=−∂x∂H=−∂x∂E.
- dtdH=∑i∂xi∂Edtdxi+∑i∂νi∂Kdtdνi=0.
- we can apply Leap-frog Integrator where is a numerical approximation of the Hamiltonian dynamics:
- ν(t+0.5ϵ)=ν(t)+2ϵdtd=(t)=ν(t)−2ϵ∂x∂E(x(t))
- x(t+ϵ)=x(t)+ϵdtdx(t)=x(t)+ϵ∂ν∂K(ν(t+0.5ϵ))
- ν(t+ϵ)=ν(t+0.5ϵ)−2ϵ∂x∂E(x(t+ϵ))
- The joint distribution p(x,ν)∝exp(−H(x,ν)).
That is, the MCMC procedure is (run until mixing):
- start at (x(0),ν(0)).
- Sampel the momentum from the standard Gaussian distribution ν(t)∼N(0,I).
- continue at (x,ν)=(x(t−1),ν(t)) for a short time interval ϵ and run the leap-frog integrator for L steps to reach (x′,ν′)
- Accept the new state (x′,−ν′) with probability min{1,exp(H(x(t−1),ν(t−1))/exp(H(x′,ν′)))}. (lower energy is more likely to be accepted)
Might be problems with HMC:
- Sample from unnormalized posterior
- Estimate statistics from simulated value of x
- Posterior predictive distribution of unobserved outcomes can be obtained by further simulation conditional on drawn values of x.
We would like to use R hat to check the convergence and how well the mix of HMC.