My statistics blog: 2019

The second law of thermodynamics states that the entropy or disorder of a system always increases over time. A hot cup of tea, for example, represents an ordered state, since the atoms in the teacup have significantly more energy than its surrounding. The second law tells us that the teacup will cool down, meaning the energy will be dispersed over its surroundings making the entire system less ordered.

After this law was established in the late 19th century, people started to apply it to the Universe as a whole and ask, why is the Universe in such an ordered state? The Universe mainly consists of stars, planets, and galaxies, separated by huge expanses of empty space. So the Universe represents a very ordered state. Given that the Universe was considered to be eternal the second law of thermodynamics would suggest that we should live in a very disordered Universe, which would not allow life to form.

There are two solutions to this problem. One version is that the Universe is not eternal and it started with very low entropy. Why the Universe started in a low entropy state is unclear and remains a key question in cosmology. The second solution makes use of the fact that the laws of thermodynamics are statistical laws and hence while the entropy should always increase over time, there is a small (but non-zero) probability that it decreases. One could use this to argue that the Universe just happened to randomly fluctuate to a low entropy state. For example, atoms could randomly assemble in a tiny region of space, which would give rise to the BigBang and create the Universe we observe today. Even though such a fluctuation is extremely unlikely, one can invoke the anthropic principle, which states that we can only live in parts of the eternal Universe which is subject to such an unlikely fluctuation.

Given that the probabilities for such thermal fluctuations decrease exponentially with the number of atoms involved, it is far more likely to have small fluctuations in small regions of the Universe compared to creating our observed Universe.
The Boltzmann brain is a thought experiment in which such a random fluctuation would form a fully functional brain without any life-supporting environment. It is vastly more likely that a thermal fluctuation creates a Boltzmann brain rather than an entire Universe which can harbor life, meaning that if life really is formed by thermal fluctuations, most of the life in the Universe consists of Boltzmann brains.

Boltzmann brains are linked to the measure problem in cosmology, where one often has to deal with infinities in time and space, which means that even small probabilities can become feasible given enough time or space. If such scenarios would result in Boltzmann brains as the most common life form, this is usually considered to be an indication that this is not a viable scenario. However, the question why we live in our Universe rather than as isolated brains floating in the Universe is not really solved yet (here is a recent paper on the subject https://arxiv.org/pdf/1702.00850.pdf).

Let me know if you have any question in the comment section below
best
Florian

A Monte Carlo Markov Chain (MCMC) is a very popular method to obtain the likelihood for a large parameter space and often it is the only computationally feasible way to obtain the likelihood. However, one disadvantage of MCMC is that it is not clear when the MCMC chain actually represents the likelihood, i.e. when it has converged. Many convergence criteria have been proposed, but none of them can guarantee convergence. In this blog post, I will explain the Gelman & Rubin convergence criteria, which is one of the most popular indicators of convergence. I will also present a Python implementation of it, which can be used for the Markov Chain Monte Carlo (MCMC) Ensemble sampler emcee.

The Gelman & Rubin criteria consist of the following 4 steps:

Run multiple MCMC chains and remove the first half of each chain.
Calculate the mean and variance for each parameter in each chain.
Take the mean value of each parameter from the different chains and determine the variance between the mean values of the different chains (between chain variance).
Compare the between chain variance with the variance of each parameter within one chain. If the between chain and within chain variances are sufficiently close, the chains can be assumed to have converged.

The basic concept is that the variance for each parameter determined within a chain should be equal to the variance between the different chains. The mean and variance for each parameter $\theta_j$ and a chain with length $N$ is given by
\[
\begin{align}
\bar{\theta}_j &= \frac{1}{N}\sum^N_{i=1} \theta_i\\
\sigma^2(\theta_j) &= \frac{1}{N-1}\sum^N_{i=1}(\theta_{ij} - \overline{\theta}_j)^2
\end{align}
\]To calculate the Gelman & Rubin criteria we have to run multiple chains and calculate $\bar{\bar{\theta}}_j$ (mean of means) for each parameter $j$ as well as the mean variance $\bar{\sigma}(\theta_j)$ and variance of the mean $\sigma(\bar{\theta}_j)$:
\[
\begin{align}
\bar{\bar{\theta}}_j &= \frac{1}{N_{\rm chains}}\sum^{N_{\rm chains}}_{j=1}\bar{\theta}_j\\
\bar{\sigma}^2(\theta_j) &= \frac{1}{N_{\rm chains}}\sum^{N_{\rm chains}}_{j=1}\sigma^2(\theta_j)\\
\sigma^2(\bar{\theta}_j) &= \frac{1}{N_{\rm chains}-1}\sum^{N_{\rm chains}}_{j=1}(\bar{\theta}_j - \bar{\bar{\theta}}_j)^2.
\end{align}
\]The variance of the means, $\sigma^2(\bar{\theta}_j)$, will converge to zero for $N\rightarrow\infty$, but for any finite chain it will have a value which can be estimated as $\frac{1}{N}\bar{\sigma}^2(\theta_j)$. If the chains started with widely dispersed initial conditions, the term
\[
B_j = \sigma^2(\bar{\theta}_j) - \frac{1}{N}\bar{\sigma}^2(\theta_j)
\]will converge to zero from above. The Gelman & Rubin convergence criteria therefore defines so-called scale reduction parameters as
\[
\hat{R}_j = \sqrt{1 + \frac{B_j}{\bar{\sigma}^2(\theta_j)}},
\]which will converge to $1$ from above. Convergence is reached if these scale reduction parameters are close to 1 for each free parameter of the fit.

How close do the scale reduction parameters have to be to 1? There is no clear answer here, but a good idea is to plot these parameters as a function of time, which should show monotonic convergence to 1. Another criterion can be motivated by your use case. If you want to use these likelihoods to distinguish between different models, you should consider that a scale reduction parameter different to 1 represents an error on this likelihood. So to distinguish between two likelihoods you should make sure that the error on the likelihood is sufficiently small to distinguish between the two.

Given the equations above we can now implement this convergence criterium. Here I will present a Python implementation which can directly be used within the MCMC Ensemble sampler emcee. We can run emcee with a few simple lines of code

import emcee
import numpy as np
import para
rstate = np.random.get_state()
pos = set_initial_position()
sampler = emcee.EnsembleSampler(para.nwalkers, para.ndim, func)
for result in sampler.sample(pos, iterations=para.chainstep, rstate0=rstate):
    pos = result[0]
    rstate = result[2]

Here I used a parameter file para, which contains all relevant parameters of the fit. In my case it currently looks like this

# MCMC parameters
Nchains = 8
ithin = 100
minlength = 40000 
ichaincheck = 1000 
nwalkers = 100 
epsilon = 0.04

where nwalkers defines the number of walkers (emcee allows to parallelize MCMC using multiple 'walkers' which talk to each other), ndim gives the number of free parameters and chainstep defines the length of the MCMC chain.

An important parameter in the above code is pos, which defines the initial values for each parameter (and each walker in emcee). A possible choice for the function to get the initial position is

def set_initial_position():
    pos = []
    for walker in range(0,para.nwalkers): 
        ran_scatter = np.random.random_sample((para.ndim,))-1.)
        start_sig = 2.*ran_scatter*para.expected_sig
        start_pos = para.max_like_para + start_sig
        pos.append(start_pos)
    return pos

where para.max_like_para represents an array of maximum likelihood values for each parameter and para.expected_sig represents an array of expected standard deviations for each parameter. Of course, these arrays might not be available and hence it could be quite difficult to define the initial positions for the chain. Often a conservative guess is the best one can do.

The code above produces a chain of fixed length, while we want to run the chain until the Gelman & Rubin criterium indicates convergence. Therefore we extend the above code to

withinchainvar = np.zeros((para.Nchains, para.ndim), dtype=np.float)
meanchain = np.zeros((para.Nchains, para.ndim), dtype=np.float)
scalereduction = np.array(para.ndim, dtype=np.float)

chainstep = para.minlength
itercounter = para.minlength
loopcriteria = True
# Run chain until the chain has converged
while loopcriteria:
    for jj in range(0, para.Nchains):
        print("process chain %d" % jj)
        for result in sampler[jj].sample(pos[jj], 
                                         iterations=chainstep, 
                                         rstate0=rstate[jj]):
            pos[jj] = result[0]
            rstate[jj] = result[2]
            # Use the second half of the chain for convergence test
        chain_length_per_walker = int(sampler[jj].chain.shape[1])
        chainsamples = sampler[jj].chain[:, int(chain_length_per_walker/2):, :]\
                                  .reshape((-1, para.ndim))
        chain_length = len(chainsamples)
        # Variance for each parameter within one chain
        withinchainvar[jj] = np.var(chainsamples, axis=0)
        # Mean for each parameter within one chain
        meanchain[jj] = np.mean(chainsamples, axis=0)
    
    R = gelman_rubin_convergence(withinchainvar,
                                 meanchain, 
                                 chain_length, 
                                 para.Nchains)
    loopcriteria = not all(np.absolute(1 - R) < para.epsilon)
    chainstep = para.ichaincheck
    itercounter += chainstep

where the gelman_rubin_convergence() function is defined as

def gelman_rubin_convergence(withinchainvar, meanchain, chain_length, N):    
    meanall = np.mean(meanchain, axis=0)
    mean_wcv = np.mean(withinchainvar, axis=0)
    vom = np.zeros(ndim, dtype=np.float)
    for jj in range(0, N):
        vom += (meanall - meanchain[jj])**2/(N-1.)
    B = vom - mean_wcv/chain_length
    return np.sqrt(1. + B/mean_wcv)

where the two parameters ichaincheck and epsilon define the number of steps between which the scale reduction parameters are calculated, and the threshold for the scale reduction parameters, respectively.

The biggest danger with this convergence criteria is that one starts all chains with the same or very similar initial conditions, in which case the procedure described above might indicate convergence even for completely unconverged chains. Hence one generally should use multiple convergence criteria. Some additional signs of convergence are

Any sub-segment of the chain gives a similar result to the entire chain.
The acceptance rate is close to the equilibrium acceptance rate for the algorithm used. E.g. for Metropolis-Hastings it is around 50%.
The correlation length of the chain is far smaller than the length of the chain.
Multiple chains initialized from different initial conditions give similar results.

Hope this was useful. If you have any questions or comments, please leave them below.
cheers
Florian

My statistics blog

Monday, July 22, 2019

What is a Boltzmann brain?

Sunday, April 28, 2019

Gelman & Rubin convergence criteria for MCMC