The Gelman & Rubin criteria consist of the following 4 steps:
- Run multiple MCMC chains and remove the first half of each chain.
- Calculate the mean and variance for each parameter in each chain.
- Take the mean value of each parameter from the different chains and determine the variance between the mean values of the different chains (between chain variance).
- Compare the between chain variance with the variance of each parameter within one chain. If the between chain and within chain variances are sufficiently close, the chains can be assumed to have converged.
\[
\begin{align}
\bar{\theta}_j &= \frac{1}{N}\sum^N_{i=1} \theta_i\\
\sigma^2(\theta_j) &= \frac{1}{N-1}\sum^N_{i=1}(\theta_{ij} - \overline{\theta}_j)^2
\end{align}
\]To calculate the Gelman & Rubin criteria we have to run multiple chains and calculate $\bar{\bar{\theta}}_j$ (mean of means) for each parameter $j$ as well as the mean variance $\bar{\sigma}(\theta_j)$ and variance of the mean $\sigma(\bar{\theta}_j)$:
\[
\begin{align}
\bar{\bar{\theta}}_j &= \frac{1}{N_{\rm chains}}\sum^{N_{\rm chains}}_{j=1}\bar{\theta}_j\\
\bar{\sigma}^2(\theta_j) &= \frac{1}{N_{\rm chains}}\sum^{N_{\rm chains}}_{j=1}\sigma^2(\theta_j)\\
\sigma^2(\bar{\theta}_j) &= \frac{1}{N_{\rm chains}-1}\sum^{N_{\rm chains}}_{j=1}(\bar{\theta}_j - \bar{\bar{\theta}}_j)^2.
\end{align}
\]The variance of the means, $\sigma^2(\bar{\theta}_j)$, will converge to zero for $N\rightarrow\infty$, but for any finite chain it will have a value which can be estimated as $\frac{1}{N}\bar{\sigma}^2(\theta_j)$. If the chains started with widely dispersed initial conditions, the term
\[
B_j = \sigma^2(\bar{\theta}_j) - \frac{1}{N}\bar{\sigma}^2(\theta_j)
\]will converge to zero from above. The Gelman & Rubin convergence criteria therefore defines so-called scale reduction parameters as
\[
\hat{R}_j = \sqrt{1 + \frac{B_j}{\bar{\sigma}^2(\theta_j)}},
\]which will converge to $1$ from above. Convergence is reached if these scale reduction parameters are close to 1 for each free parameter of the fit.
How close do the scale reduction parameters have to be to 1? There is no clear answer here, but a good idea is to plot these parameters as a function of time, which should show monotonic convergence to 1. Another criterion can be motivated by your use case. If you want to use these likelihoods to distinguish between different models, you should consider that a scale reduction parameter different to 1 represents an error on this likelihood. So to distinguish between two likelihoods you should make sure that the error on the likelihood is sufficiently small to distinguish between the two.
Given the equations above we can now implement this convergence criterium. Here I will present a Python implementation which can directly be used within the MCMC Ensemble sampler emcee. We can run emcee with a few simple lines of code
Here I used a parameter file para, which contains all relevant parameters of the fit. In my case it currently looks like this
where nwalkers defines the number of walkers (emcee allows to parallelize MCMC using multiple 'walkers' which talk to each other), ndim gives the number of free parameters and chainstep defines the length of the MCMC chain.
An important parameter in the above code is pos, which defines the initial values for each parameter (and each walker in emcee). A possible choice for the function to get the initial position is
where para.max_like_para represents an array of maximum likelihood values for each parameter and para.expected_sig represents an array of expected standard deviations for each parameter. Of course, these arrays might not be available and hence it could be quite difficult to define the initial positions for the chain. Often a conservative guess is the best one can do.
The code above produces a chain of fixed length, while we want to run the chain until the Gelman & Rubin criterium indicates convergence. Therefore we extend the above code to
where the gelman_rubin_convergence() function is defined as
where the two parameters ichaincheck and epsilon define the number of steps between which the scale reduction parameters are calculated, and the threshold for the scale reduction parameters, respectively.
import emcee import numpy as np import para rstate = np.random.get_state() pos = set_initial_position() sampler = emcee.EnsembleSampler(para.nwalkers, para.ndim, func) for result in sampler.sample(pos, iterations=para.chainstep, rstate0=rstate): pos = result[0] rstate = result[2]
# MCMC parameters Nchains = 8 ithin = 100 minlength = 40000 ichaincheck = 1000 nwalkers = 100 epsilon = 0.04
An important parameter in the above code is pos, which defines the initial values for each parameter (and each walker in emcee). A possible choice for the function to get the initial position is
def set_initial_position(): pos = [] for walker in range(0,para.nwalkers): ran_scatter = np.random.random_sample((para.ndim,))-1.) start_sig = 2.*ran_scatter*para.expected_sig start_pos = para.max_like_para + start_sig pos.append(start_pos) return pos
The code above produces a chain of fixed length, while we want to run the chain until the Gelman & Rubin criterium indicates convergence. Therefore we extend the above code to
withinchainvar = np.zeros((para.Nchains, para.ndim), dtype=np.float) meanchain = np.zeros((para.Nchains, para.ndim), dtype=np.float) scalereduction = np.array(para.ndim, dtype=np.float) chainstep = para.minlength itercounter = para.minlength loopcriteria = True # Run chain until the chain has converged while loopcriteria: for jj in range(0, para.Nchains): print("process chain %d" % jj) for result in sampler[jj].sample(pos[jj], iterations=chainstep, rstate0=rstate[jj]): pos[jj] = result[0] rstate[jj] = result[2] # Use the second half of the chain for convergence test chain_length_per_walker = int(sampler[jj].chain.shape[1]) chainsamples = sampler[jj].chain[:, int(chain_length_per_walker/2):, :]\ .reshape((-1, para.ndim)) chain_length = len(chainsamples) # Variance for each parameter within one chain withinchainvar[jj] = np.var(chainsamples, axis=0) # Mean for each parameter within one chain meanchain[jj] = np.mean(chainsamples, axis=0) R = gelman_rubin_convergence(withinchainvar, meanchain, chain_length, para.Nchains) loopcriteria = not all(np.absolute(1 - R) < para.epsilon) chainstep = para.ichaincheck itercounter += chainstep
def gelman_rubin_convergence(withinchainvar, meanchain, chain_length, N): meanall = np.mean(meanchain, axis=0) mean_wcv = np.mean(withinchainvar, axis=0) vom = np.zeros(ndim, dtype=np.float) for jj in range(0, N): vom += (meanall - meanchain[jj])**2/(N-1.) B = vom - mean_wcv/chain_length return np.sqrt(1. + B/mean_wcv)
The biggest danger with this convergence criteria is that one starts all chains with the same or very similar initial conditions, in which case the procedure described above might indicate convergence even for completely unconverged chains. Hence one generally should use multiple convergence criteria. Some additional signs of convergence are
- Any sub-segment of the chain gives a similar result to the entire chain.
- The acceptance rate is close to the equilibrium acceptance rate for the algorithm used. E.g. for Metropolis-Hastings it is around 50%.
- The correlation length of the chain is far smaller than the length of the chain.
- Multiple chains initialized from different initial conditions give similar results.
cheers
Florian