Sunday, April 28, 2019

Gelman & Rubin convergence criteria for MCMC

A Monte Carlo Markov Chain (MCMC) is a very popular method to obtain the likelihood for a large parameter space and often it is the only computationally feasible way to obtain the likelihood. However, one disadvantage of MCMC is that it is not clear when the MCMC chain actually represents the likelihood, i.e. when it has converged. Many convergence criteria have been proposed, but none of them can guarantee convergence. In this blog post, I will explain the Gelman & Rubin convergence criteria, which is one of the most popular indicators of convergence. I will also present a Python implementation of it, which can be used for the Markov Chain Monte Carlo (MCMC) Ensemble sampler emcee.

The Gelman & Rubin criteria consist of the following 4 steps:
  1. Run multiple MCMC chains and remove the first half of each chain.
  2. Calculate the mean and variance for each parameter in each chain.
  3. Take the mean value of each parameter from the different chains and determine the variance between the mean values of the different chains (between chain variance).
  4. Compare the between chain variance with the variance of each parameter within one chain. If the between chain and within chain variances are sufficiently close, the chains can be assumed to have converged. 
The basic concept is that the variance for each parameter determined within a chain should be equal to the variance between the different chains. The mean and variance for each parameter $\theta_j$ and a chain with length $N$ is given by
\[
\begin{align}
\bar{\theta}_j &= \frac{1}{N}\sum^N_{i=1} \theta_i\\
\sigma^2(\theta_j) &= \frac{1}{N-1}\sum^N_{i=1}(\theta_{ij} - \overline{\theta}_j)^2
\end{align}
\]To calculate the Gelman & Rubin criteria we have to run multiple chains and calculate $\bar{\bar{\theta}}_j$ (mean of means) for each parameter $j$ as well as the mean variance $\bar{\sigma}(\theta_j)$ and variance of the mean $\sigma(\bar{\theta}_j)$:
\[
\begin{align}
\bar{\bar{\theta}}_j &= \frac{1}{N_{\rm chains}}\sum^{N_{\rm chains}}_{j=1}\bar{\theta}_j\\
\bar{\sigma}^2(\theta_j) &= \frac{1}{N_{\rm chains}}\sum^{N_{\rm chains}}_{j=1}\sigma^2(\theta_j)\\
\sigma^2(\bar{\theta}_j) &= \frac{1}{N_{\rm chains}-1}\sum^{N_{\rm chains}}_{j=1}(\bar{\theta}_j - \bar{\bar{\theta}}_j)^2.
\end{align}
\]The variance of the means, $\sigma^2(\bar{\theta}_j)$, will converge to zero for $N\rightarrow\infty$, but for any finite chain it will have a value which can be estimated as $\frac{1}{N}\bar{\sigma}^2(\theta_j)$. If the chains started with widely dispersed initial conditions, the term
\[
B_j = \sigma^2(\bar{\theta}_j) - \frac{1}{N}\bar{\sigma}^2(\theta_j)
\]will converge to zero from above. The Gelman & Rubin convergence criteria therefore defines so-called scale reduction parameters as
\[
\hat{R}_j = \sqrt{1 + \frac{B_j}{\bar{\sigma}^2(\theta_j)}},
\]which will converge to $1$ from above. Convergence is reached if these scale reduction parameters are close to 1 for each free parameter of the fit.

How close do the scale reduction parameters have to be to 1? There is no clear answer here, but a good idea is to plot these parameters as a function of time, which should show monotonic convergence to 1. Another criterion can be motivated by your use case. If you want to use these likelihoods to distinguish between different models, you should consider that a scale reduction parameter different to 1 represents an error on this likelihood. So to distinguish between two likelihoods you should make sure that the error on the likelihood is sufficiently small to distinguish between the two.
Given the equations above we can now implement this convergence criterium. Here I will present a Python implementation which can directly be used within the MCMC Ensemble sampler emcee. We can run emcee with a few simple lines of code
import emcee
import numpy as np
import para
rstate = np.random.get_state()
pos = set_initial_position()
sampler = emcee.EnsembleSampler(para.nwalkers, para.ndim, func)
for result in sampler.sample(pos, iterations=para.chainstep, rstate0=rstate):
    pos = result[0]
    rstate = result[2]
Here I used a parameter file para, which contains all relevant parameters of the fit. In my case it currently looks like this
# MCMC parameters
Nchains = 8
ithin = 100
minlength = 40000 
ichaincheck = 1000 
nwalkers = 100 
epsilon = 0.04 
where nwalkers defines the number of walkers (emcee allows to parallelize MCMC using multiple 'walkers' which talk to each other), ndim gives the number of free parameters and chainstep defines the length of the MCMC chain.

An important parameter in the above code is pos, which defines the initial values for each parameter (and each walker in emcee). A possible choice for the function to get the initial position is
def set_initial_position():
    pos = []
    for walker in range(0,para.nwalkers): 
        ran_scatter = np.random.random_sample((para.ndim,))-1.)
        start_sig = 2.*ran_scatter*para.expected_sig
        start_pos = para.max_like_para + start_sig
        pos.append(start_pos)
    return pos
where para.max_like_para represents an array of maximum likelihood values for each parameter and para.expected_sig represents an array of expected standard deviations for each parameter. Of course, these arrays might not be available and hence it could be quite difficult to define the initial positions for the chain. Often a conservative guess is the best one can do.

The code above produces a chain of fixed length, while we want to run the chain until the Gelman & Rubin criterium indicates convergence. Therefore we extend the above code to
withinchainvar = np.zeros((para.Nchains, para.ndim), dtype=np.float)
meanchain = np.zeros((para.Nchains, para.ndim), dtype=np.float)
scalereduction = np.array(para.ndim, dtype=np.float)

chainstep = para.minlength
itercounter = para.minlength
loopcriteria = True
# Run chain until the chain has converged
while loopcriteria:
    for jj in range(0, para.Nchains):
        print("process chain %d" % jj)
        for result in sampler[jj].sample(pos[jj], 
                                         iterations=chainstep, 
                                         rstate0=rstate[jj]):
            pos[jj] = result[0]
            rstate[jj] = result[2]
            # Use the second half of the chain for convergence test
        chain_length_per_walker = int(sampler[jj].chain.shape[1])
        chainsamples = sampler[jj].chain[:, int(chain_length_per_walker/2):, :]\
                                  .reshape((-1, para.ndim))
        chain_length = len(chainsamples)
        # Variance for each parameter within one chain
        withinchainvar[jj] = np.var(chainsamples, axis=0)
        # Mean for each parameter within one chain
        meanchain[jj] = np.mean(chainsamples, axis=0)
    
    R = gelman_rubin_convergence(withinchainvar,
                                 meanchain, 
                                 chain_length, 
                                 para.Nchains)
    loopcriteria = not all(np.absolute(1 - R) < para.epsilon)
    chainstep = para.ichaincheck
    itercounter += chainstep
where the gelman_rubin_convergence() function is defined as
def gelman_rubin_convergence(withinchainvar, meanchain, chain_length, N):    
    meanall = np.mean(meanchain, axis=0)
    mean_wcv = np.mean(withinchainvar, axis=0)
    vom = np.zeros(ndim, dtype=np.float)
    for jj in range(0, N):
        vom += (meanall - meanchain[jj])**2/(N-1.)
    B = vom - mean_wcv/chain_length
    return np.sqrt(1. + B/mean_wcv)
where the two parameters ichaincheck and epsilon define the number of steps between which the scale reduction parameters are calculated, and the threshold for the scale reduction parameters, respectively.

The biggest danger with this convergence criteria is that one starts all chains with the same or very similar initial conditions, in which case the procedure described above might indicate convergence even for completely unconverged chains. Hence one generally should use multiple convergence criteria. Some additional signs of convergence are
  1. Any sub-segment of the chain gives a similar result to the entire chain.
  2. The acceptance rate is close to the equilibrium acceptance rate for the algorithm used. E.g. for Metropolis-Hastings it is around 50%.
  3. The correlation length of the chain is far smaller than the length of the chain.
  4. Multiple chains initialized from different initial conditions give similar results.
Hope this was useful. If you have any questions or comments, please leave them below.
cheers
Florian