My statistics blog: October 2017

Thursday, October 12, 2017

Study the Universe with Python tutorial, part 5 -- Monte Carlo Markov Chain

This is the fifth blog post in this series which discusses the Baryon Oscillation Spectroscopic dataset (BOSS). In the last 4 posts, we downloaded the data, calculated the power spectrum and covariance matrix and isolated the BAO feature. Now we want to apply this procedure to the data and extract cosmological information.

In the last post, we used the scipy minimize function to find the best-fitting model. However, the best-fitting model is not very useful if we don't know the uncertainties attached to it.

What we really need is a probability for each possible model which could describe our Universe. The probability for model $Y$ with parameters x should reflect the probability that we would observe the BOSS dataset if the Universe were described by $Y(x)$. Such a probability distribution is called a likelihood and a very powerful way to get such a likelihood is a Monte Carlo Markov Chain (MCMC).

What is an MCMC? An MCMC is a chain of parameter sets which has the nice property that the appearance of each parameter in the chain also represents its likelihood distribution. This means one can take the chain, produce a histogram of parameter x which directly provides the (marginalised) likelihood for parameter x (marginalised just means that you integrate over the likelihood of all other parameters).

There are many algorithms which can produce such a chains, but the most basic and well-known is the Metropolis-Hastings algorithm. Here we will just focus on the Python implementation of such a procedure, but if you have never encountered MCMC it is certainly worth learning about it.

We will use the MCMC emcee package which can be installed with
>> pip install emcee
and included with

import emcee

We can call this MCMC sampler like

nwalkers = 20
dim = len(start)
expected_error = [0.1, 1., 1., 1., 1., 1., 0.05, 0.1]
sampler = emcee.EnsembleSampler(nwalkers=nwalkers, dim=dim, lnpostfn=get_loglike,\
 args=(boss_data, { 'noBAO': Pk_without_BAO, 'os_model': os_model }, get_shifted_model))
start_positions = [(result['x'] + (2.*np.random.random_sample((dim,)) - 1)*expected_error) for i in range(nwalkers)]
sampler.run_mcmc(start_positions, 1000)

Here we have to specify the number of walkers (nwalkers) since this MCMC package will produce several MCMC chains in parallel (each walker produces one chain). The separate chains produced by each walker are not really independent, since they communicate with each other (Goodman & Weare, 2010). We also need to specify the starting values for the different walkers. To do that we use the results from the minimisation procedure discussed in the last post and add a random variation around it. Different to our minimisation function in the last blog post, this algorithm considers high values a good match. Therefore we have to write a small function to turn our $\chi^2$ into a log-likelihood, which is defined as

def get_loglike(parameters, data, templates, func):
    return -0.5*calc_chi2(parameters, data, templates, func)

The run_mcmc() method produces a chain with 1000 rows times the number of walkers.

The problem with MCMC is that it requires a certain time to converge. This means the number of appearances of a parameter only reflects the likelihood if the chain has converged. Determining the convergence is very tricky, and there is no procedure which is guaranteed to detect the correct convergence. Here we will use the Gelman & Rubin criterium.

The Gelman & Rubin criterium compares several independent chains and tries to determine whether the variances within these chains are similar. Here we will not discuss the details of the implementation, but you can read more about it in this interesting blog post.

To use the Gelman & Rubin criterium we need to use several MCMC chains and compare them. The chains are independent, thus usually ran in parallel to save time. This means we have to re-write the MCMC procedure

def runMCMC(start, data, templates):
    ''' Perform MCMC '''
    dim = len(start)
    Nchains = 4
    nwalkers = 20
    ichaincheck = 400
    minlength = 2000
    epsilon = 0.01

    labels = ['b', 'A1', 'A2', 'A3', 'A4', 'A5', 'alpha', 'sigmaNL']
    expected_error = [0.1, 1., 1., 1., 1., 1., 0.05, 0.1]
    # Set up the sampler.
    pos=[]
    list_of_samplers=[]
    for jj in range(0, Nchains):
        pos.append([start + (2.*np.random.random_sample((dim,)) - 1.)*expected_error for i in range(nwalkers)])
        list_of_samplers.append(emcee.EnsembleSampler(nwalkers=nwalkers, dim=dim, lnpostfn=get_loglike,\
         args=(data, templates, get_shifted_model)))

    # Start MCMC
    print("Running MCMC... ")
    within_chain_var = np.zeros((Nchains, dim))
    mean_chain = np.zeros((Nchains, dim))
    scalereduction = np.arange(dim, dtype=np.float)
    scalereduction.fill(2.)

    itercounter = 0
    chainstep = minlength
    while any(abs(1. - scalereduction) > epsilon):
        itercounter += chainstep
        for jj in range(0, Nchains):
            for result in list_of_samplers[jj].sample(pos[jj], iterations=chainstep, storechain=True):
                pos[jj] = result[0]
            # we do the convergence test on the second half of the current chain (itercounter/2)
            within_chain_var[jj], mean_chain[jj] = prep_gelman_rubin(list_of_samplers[jj])
        scalereduction = gelman_rubin_convergence(within_chain_var, mean_chain, int(itercounter/2))
        print("scalereduction = ", scalereduction)
        chainstep = ichaincheck
    # Investigate the chain and print out some metrics
    inspect_chain(list_of_samplers, labels)
    return

This function basically does the same thing as the function we had before, but it now runs within a loop, which is only broken when the Gelman & Rubin parameters in the scalereduction array are within a certain range of unity.

The inspect_chain() function provides several diagnostics, with which we can further test whether the chain has actually converged. For example, we can print out the timeline of the sampling range for each parameter

fig, axes = plt.subplots(dim, 1, sharex=True, figsize=(8, 9))
for i in range(0, dim):
    for jj in range(0, Nchains):
        axes[i].plot(list_of_samplers[jj].chain[:, :, i].T, alpha=0.4)
    #axes[i].yaxis.set_major_locator(MaxNLocator(5))
    axes[i].set_ylabel(labels[i])
fig.tight_layout(h_pad=0.0)
fig.savefig("%stime_series.png" % output_path)

which gives

Figure 1: Variance in the different parameters. The convergence

has been reached if the variance is constant with time. In the early

stages of the chain the variance changes, indicating that the

convergence has not been reached.

This figure shows that the sampling range varies in the beginning, which indicates that the chain has not converged yet. The second half of the chain has an almost constant variance which is a good indication of convergence.

If we convince ourselves that the chain has indeed converged we can plot the likelihood. Here we use corner.py which can be installed with
>> pip install corner
and included with

import corner

We can then feed our chain into this plotting package as

fig = corner.corner(mergedsamples, quantiles=[0.16, 0.5, 0.84], plot_density=False,\
    show_titles=True, title_fmt=".3f", labels=labels)
fig.savefig("%scorner.png" % output_path)

which gives

Figure 2: 2D likelihood distributions for the 7 parameters in our fit.

The 1D marginalised likelihood is given on the right together with the

$68\%$ confidence levels.

Finally, we can also print out the constraints with

def get_percentiles(chain, labels):
    ''' Calculate constraints and uncertainties from MCMC chain '''
    per = np.percentile(chain, [50., 15.86555, 84.13445, 2.2775, 97.7225], axis=0)
    per = [per[0], per[0]-per[1], per[2]-per[0], per[0]-per[3], per[4]-per[0]]
    for i in range(0, len(per[0])):
        print("%s = %f +%f -%f +%f -%f" % (labels[i], per[0][i], per[1][i], per[2][i], per[3][i], per[4][i]))
    return per

and with the BOSS data, we should get something like

>> b = 2.295306 +0.314054 -0.312351 +0.584276 -0.610636

A1 = -0.151418 +0.111546 -0.111835 +0.223292 -0.232080

A2 = 41.676085 +16.889086 -17.505383 +30.571086 -37.981033

A3 = -3306.605503 +1642.375097 -1467.782597 +3487.705208 -2538.709703

A4 = 14835.170765 +5912.070481 -6608.236429 +10122.305967 -14237.721237

A5 = -21141.535952 +9150.798840 -8221.692166 +20105.713945 -14085.031087

alpha = 0.990088 +0.032634 -0.027241 +0.077669 -0.059837

sigmaNL = 11.378695 +2.975023 -4.083891 +5.036540 -7.682274

Now what does that mean in terms of cosmological parameters? Well as mentioned before, the only parameter of interest here is the scaling parameter $\alpha$. This parameter is defined as
\[
\alpha = \frac{D_V}{D^{\rm fid}_V}
\]where the the distance $D_V$ is defined as
\[
D_V = \left[(1+z)^2D^2_A\frac{cz}{H(z)}\right]^{1/3}
\]The superscribed $^{\rm fid}$ just means that this is the same quantity in the fiducial cosmology, which is the one we defined in the cosmo class in nbodykit.

The distance $D_V(z)$ is an average of the angular diameter distance $D_A(z)$ and the Hubble parameter $H(z)$
\[
\begin{align}
D_A(z) &= \frac{c}{1+z}\int^z_0\frac{dz'}{H(z')}\\
H(z) &= H_0\left[(\Omega_cdm+\Omega_b)(1+z)^3 + \Omega_{\Lambda}\right]^{1/2}
\end{align}
\]With this we can now turn the constraint on $\alpha$ into a constraint on $D_V$

Hz = cosmo.efunc(zmean)*cosmo.H0*cosmo.h
DC = (1.+zmean)*cosmo.angular_diameter_distance(zmean)/cosmo.h
c_km_s = (speed_of_light/1000.)
DVfid = ( DC**2*(zmean*c_km_s/Hz) )**(1./3.)
DV = per[:,6]*DVfid
print("DV = %f +%f -%f +%f -%f\n" % (DV[0], DV[1], DV[2], DV[3], DV[4]))

Which should show something like
>> DV = 2152.179924 +69.092697 -57.551602 +164.343396 -121.202709
The next step would be to combine this constraint with measurements from the Cosmic Microwave Background (CMB), to get constraints on actual cosmological parameters. But for now, I will leave it here.

Thanks a lot for making it to the end of this series. If you are interested in going further, you should consider a few extensions to this analysis:

Have a look at density field reconstruction
Consider an anisotropic analysis to constrain $D_A(z)$ and $H(z)$ independently
Do you think the polynomial model we use is the best choice? Can you come up with a better model?
Use the simulated BOSS catalogues to test whether our procedure is robust

The final code for this analysis is available on GitHub.

cheers

Florian

Study the Universe with Python tutorial, part 4 -- minimizing $\chi^2$

This is the fourth blog post about how to use the Baryon Oscillation Spectroscopic (BOSS) dataset to constrain cosmology. In the first three posts, we downloaded the data, calculated the power spectrum and obtained a covariance matrix.

The power spectrum we measured carries a huge amount of information about cosmological parameters, but most of that information cannot be accessed easily. The reason is that we cannot calculate reliable models for the power spectrum, to compare to the data. In a previous post, we calculated models using the class code, but these models are only accurate on very large scales (small wavenumbers).

For that reason, we will focus here on one particular signal, which can be described with a very simple model. This signal is called Baryon Acoustic Oscillations (BAO). BAO in the galaxy distribution were detected for the first time in 2005 and since that first detection, BAO have become one of the most powerful cosmological probes. In fact, the BAO measurement is one of the main reasons why the BOSS dataset has been assembled.

The BAO signal shows up as an oscillation signal on top of the power spectrum. In a previous plot, we saw model power spectra for different amounts of dark matter

Figure 1: Model power spectrum for different amounts
of dark matter.

You can see the BAO signal in both power spectra as an oscillation. This is the signal we are after. The BAO signal is much more robust (easier to model) compared to the rest of the power spectrum and therefore we will use a model for the power spectrum where we get rid of all but the oscillation signal.

We can do that by fitting a power spectrum without a BAO signal to a power spectrum with BAO signal. Fitting just means that we test many different power spectra and pick the one which matches the best.

The nbodykit package provides power spectra without BAO signal, all that we have to do is call the LinearPower() function with the transfer parameter 'NoWiggleEisensteinHu'.

Pk = nbodykit.lab.cosmology.power.linear.LinearPower(cosmo, redshift=0, transfer='NoWiggleEisensteinHu')

And if we plot such a power spectrum we get

Figure 2: Same power spectra as in Figure 1, but calculated
with the transfer parameter 'NoWiggleEisensteinHu', which
calculates the power spectrum without BAO signal. You can see
that the BAO wiggles have been removed.

Comparing the power spectra in Figure 1 and Figure 2 shows that these power spectra are the same except for the BAO wiggles, which have been removed.

Our model uses the noBAO power spectrum together with 5 additional polynomial terms. These parameters have no physical motivation, but their whole purpose is to make this fit independent of the broadband and focus on the oscillation signature.
\[
P^{model}(k) = p^2_0P^{noBAO}(k) + \frac{p_1}{k^3} + \frac{p_2}{k^2}+ \frac{p_3}{k} + p_4 + p_5k.
\]
and in Python

def get_smooth_model(parameters, x, templates):
    ''' Combine a noBAO model with polynomials and linear bias '''
    polynomials = parameters[1]/x**3 + parameters[2]/x**2 + parameters[3]/x + parameters[4] + parameters[5]*x
    return parameters[0]*parameters[0]*templates['noBAO'](x) + polynomials

With this, we can now extract the oscillation signature from any power spectrum model

def get_oscillation(krange, Pk_class, Pk_without_BAO):
    ''' Get an oscillation only power spectrum '''
    cov_inv = np.identity(len(krange))
    start = [1., 0., 0., 0., 0., 0.]
    result = op.minimize(calc_chi2, start, args=( [{ 'k': krange, 'pk': Pk_class(krange), 'cov_inv': cov_inv }],\
     { 'noBAO': Pk_without_BAO }, get_smooth_model))
    yolin = Pk_class(krange)/get_smooth_model(result["x"], krange, { 'noBAO': Pk_without_BAO })
    return interp.interp1d(krange, yolin)

Here we use the minimize() function from the scipy.optimize package

import scipy.optimize as op

We also need to define a metric, meaning we need to specify how we distinguish good models from bad models. Here we use the $\chi^2$, which we can calculate as

def calc_chi2(parameters, data, templates, func):
    ''' Compares the model with the data '''
    if within_priors(parameters):
        chi2 = 0.
        # Loop over all datasets which are fit together
        for dataset in data:
            model = func(parameters, dataset['k'], templates)
            diff = (model - dataset['pk'])
            chi2 += np.dot(diff,np.dot(dataset['cov_inv'],diff))
        return chi2
    else:
        return 100000.

The within_priors() function tests whether the parameters are within a certain range (prior), otherwise it returns a very large $\chi^2$.

Here is a potential output of the get_oscillation() function

Figure 3: The galaxy power spectrum divided by the best fitting
no-BAO power spectrum. With this technique, we can isolate the
BAO feature.

Our procedure removed the entire broadband power spectrum and isolated the oscillation signature. Before we apply this to the data, we have to extend the model

def get_shifted_model(parameters, x, templates):
    ''' Calculate a model including a scaling parameter '''
    model = get_smooth_model(parameters, x, templates)
    return model*(1. + (templates['os_model'](x/parameters[6]) - 1.)*np.exp(-0.5*x**2*parameters[7]**2))

Here we get the smooth model we discussed above, but modify it by introducing a smoothing scale (parameters[7]) and a scaling parameter (parameters[6]). The scaling parameter is the parameter we are actually interested in, since it carries the wavelength of the oscillation signature and is the only parameter here which we will exploit for cosmological constraints.

Now let's apply this to the data.

Pk_without_BAO = nbodykit.lab.cosmology.power.linear.LinearPower(cosmo, redshift=0, transfer='NoWiggleEisensteinHu')
Pk_class = nbodykit.lab.cosmology.power.linear.LinearPower(cosmo, redshift=0, transfer='CLASS')
krange = np.arange(0.001, 0.5, 0.001)
os_model = get_oscillation(krange, Pk_class, Pk_without_BAO)
start = [2.37, -0.076, 38., -3547., 15760., -22622., 1., 9.41]
result = op.minimize(calc_chi2, start, args=(boss_data, { 'noBAO': Pk_without_BAO, 'os_model': os_model },\
get_shifted_model))
print("result['x'] = ", result['x'])

And with this, you should get something like this

Figure 4: BOSS measurements (data points) together with the

best fitting model.

And the best fitting parameters are
>> result['x'] = [ 2.35228409e+00 -1.40192579e-01 4.30619066e+01 -3.57546008e+03

1.58788229e+04 -2.25314741e+04 9.96151333e-01 9.47806970e+00]

The problem is that we have no uncertainties attached to these parameters; we only have the best fitting values. Without uncertainties, we cannot use this to obtain cosmological constraints. So the next blog post will extend this by not just finding the best fitting parameters, but also getting the uncertainties.

The code for this project is available on GitHub. Let me know if you have any questions/comments below.
cheers
Florian

Wednesday, October 11, 2017

Study the Universe with Python tutorial, part 3 -- covariance matrix

In the last two blog posts, we discussed how to download the galaxy catalogues of the Baryon Oscillation Spectroscopic Survey (BOSS) and calculate the power spectrum of this dataset. In this blog post, we will calculate uncertainties on the power spectrum measurement.

Uncertainties can be obtained in many different ways. Here we will use simulations of the BOSS dataset. Members of the BOSS collaboration have used supercomputers to simulate what a Universe could look like and they have generated thousands of such simulated BOSS datasets. All we have to do is to calculate the power spectrum of these thousands of simulations and measure the variance between them.

We don't just calculate the uncertainty; we also calculate the correlation between different data points. All the information we need is contained in the covariance matrix, which is defined as
\[C(k_i,k_j) = \begin{pmatrix} \sigma^2_{k_1} & cor(k_1,k_2)\sigma_{k_1}\sigma_{k_2} & \dots & cor(k_1,k_n)\sigma_{k_1}\sigma_{k_n} \\
cor(k_2,k_1)\sigma_{k_2}\sigma_{k_1} & \sigma^2_{k_2} & \dots & cor(k_2,k_n)\sigma_{k_2}\sigma_{k_n} \\
\vdots & \vdots & \ddots & \vdots \\ cor(k_n,k_1)\sigma_{k_n}\sigma_{k_1} & cor(k_n,k_2)\sigma_{k_n}\sigma_{k_2} & \dots & \sigma^2_{k_n} \end{pmatrix}\]Here $\sigma_{k_i}$ is the standard deviation of the power spectrum $P(k_i)$ at wavelength $k_i$. The off-diagonal elements of this matrix contain the uncertainties on a data point contributed by neighbouring data points, which are proportional to the correlation $cor(k_i,k_j)$ between the two data points.

First, we download the simulated BOSS catalogues together with the corresponding random catalogue. Since there are so many of them, this is quite a large file

wget -N https://data.sdss.org/sas/dr12/boss/lss/dr12_multidark_patchy_mocks/Patchy-Mocks-DR12SGC-COMPSAM_V6C.tar.gz -P path/to/folder/
wget -N https://data.sdss.org/sas/dr12/boss/lss/dr12_multidark_patchy_mocks/Patchy-Mocks-DR12NGC-COMPSAM_V6C.tar.gz -P path/to/folder/
wget -N https://data.sdss.org/sas/dr12/boss/lss/dr12_multidark_patchy_mocks/Patchy-Mocks-Randoms-DR12SGC-COMPSAM_V6C_x50.tar.gz -P path/to/folder/
wget -N https://data.sdss.org/sas/dr12/boss/lss/dr12_multidark_patchy_mocks/Patchy-Mocks-Randoms-DR12NGC-COMPSAM_V6C_x50.tar.gz -P path/to/folder/

All 2048 data catalogues can be paired with one random catalogue to measure the power spectrum of the simulated catalogues.

Given that the simulated catalogues are in CSV format (rather than the data catalogues in FITS format), we have to write a new read function. using the CSVCatalog() function from nbodykit

def read_sim_data(filename):
    ''' Read the random simulation catalogues from CSV '''
    sim_cat = CSVCatalog(os.path.join(base_dir, filename),\
        names=['RA', 'DEC', 'Z', 'dummy1', 'NZ', 'dummy2', 'veto', 'Weight'])
    print('sim_cat.columns = ', sim_cat.columns)
    sim_cat = sim_cat[(sim_cat['Z'] > zmin) & (sim_cat['Z'] < zmax)]
    sim_cat = sim_cat[(sim_cat['veto'] > 0)]
    sim_cat['WEIGHT_FKP'] = 1./(1. + 10000.*sim_cat['NZ']);
    return sim_cat

def read_sim_ran(filename):
    ''' Read the simulation catalogues from CSV '''
    ran_cat = CSVCatalog(os.path.join(base_dir, filename),\
        names=['RA', 'DEC', 'Z', 'NZ', 'dummy1', 'veto', 'Weight'])
    print('ran_cat.columns = ', ran_cat.columns)
    ran_cat = ran_cat[(ran_cat['Z'] > zmin) & (ran_cat['Z'] < zmax)]
    ran_cat = ran_cat[(ran_cat['veto'] > 0)]
    ran_cat['WEIGHT_FKP'] = 1./(1. + 10000.*ran_cat['NZ']);
    return ran_cat

For both catalogues, we have to ensure that the veto column is non zero and we have to recalculate the FKP weights.

With this, we can now read in all the catalogues, calculate the power spectra and from that, we can derive a covariance matrix. We define a process_sims() function like

def process_sims(cosmo, tag, N):
    ''' Calculate the covariance matrix using the simulated BOSS datasets '''
    # Read the random catalogue which can be paired with all of the simulated catalogues
    filename = 'Patchy-Mocks-Randoms-DR12%s-COMPSAM_V6C_x50.dat' % tag
    random = read_sim_ran(filename)

    for i in range(1, N):
        pk_file = output_path + "/sims/pk_%s_%d.pickle" % (tag, i)
        print('calculating %s... ' % pk_file)
        # Only calculate power spectra which we don't have on disk
        if not os.path.isfile(pk_file):
            filename = 'Patchy-Mocks-DR12%s-COMPSAM_V6C/Patchy-Mocks-DR12%s-COMPSAM_V6C_%0.4d.dat' % (tag, tag, i)
            sim = read_sim_data(filename)
            pk = calc_pk(cosmo, sim, random)
            # store the power spectra 
            pickle.dump( pk, open( pk_file, "wb" ) )
    return calc_cov(tag, N)

This function gets a cosmological model as input as well as a string tag and an integer $N$. The tag is either 'SGC' or 'NGC' and determines whether we process the simulated catalogues for the northern part or the southern part. The integer $N$ determines the number of catalogues we process.

Since the calculation of all 2048 power spectra can take many hours, I use pickle to write each power spectrum to disk. The function then loops from $1$ to $N$, tests whether the power spectrum has already been calculated, and if not it will calculate the power spectrum using the same function we defined in the last blog post.

Finally, we calculate the covariance matrix using the calc_cov() function, which simply reads the power spectra, subtracts the shot noise, limits the range to $< k_{max} = 0.3h/$Mpc and makes use of the numpy.cov() function.

def calc_cov(tag, N):
    ''' Read simulation power spectra and return covariance matrix '''
    list_of_pks = []
    for i in range(1, N):
        pk_file = output_path + "/sims/pk_%s_%d.pickle" % (tag, i)
        pk = pickle.load( open( pk_file, "rb" ) )
        P = pk['power_0'].real - pk.attrs['shotnoise']
        # Limit the k range
        P = P[(pk['k'] < kmax)]
        list_of_pks.append(P)
    return np.cov(np.vstack(list_of_pks).T)

We can turn the covariance matrix into a correlation matrix
\[
R = \frac{C_{ij}}{\sqrt{C_{ij}C_{ji}}}
\] and plot it as

from statsmodels.stats.moment_helpers import cov2corr

def plot_cov(matrix):
    ''' Plot the correlation matrix derived from the covariance matrix '''
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    plt.imshow(cov2corr(matrix))
    plt.colorbar()
    plt.show()
    return

which should look like this

Figure 1: Correlation matrix, derived from the covariance

matrix of the northern part of BOSS using 1000 simulated

catalogues.

Figure 1 shows that there is some correlation between the power spectrum bins close to each other, while some bins which are far apart are anti-correlated.

Note that calculating 2048 power spectra will probably take a while and you might want to make use of the MPI capabilities of nbodykit. MPI stands for Massage Passing Interface and allows to run jobs in parallel. If you have 4 cores in your machine, you can theoretically calculate 4 power spectra in parallel.

To do this we have to rewrite the process_sims() function to

def process_sims_with_MPI(cosmo, tag, N):
    ''' Calculate the covariance matrix using the simulated BOSS datasets '''
    # this splits the communicator into chunks of size of roughly N
    with nbodykit.lab.TaskManager(1, use_all_cpus=True) as tm:
        # Read the random catalogue which can be paired with
        # all of the simulated catalogues
        filename = 'Patchy-Mocks-Randoms-DR12%s-COMPSAM_V6C_x50.dat' % tag
        random = read_sim_ran(filename)
        # loop over each box in parallel
        for i in tm.iterate(range(1, N)):
            pk_file = output_path + "/sims/pk_%s_%d.pickle" % (tag, i)
            print('calculating %s... ' % pk_file)
            # Only calculate power spectra which we don't have on disk
            if not os.path.isfile(pk_file):
                filename = ("Patchy-Mocks-DR12%s-COMPSAM_V6C/Patchy-"
                            Mocks-DR12%s-COMPSAM_V6C_%0.4d.dat" % (tag, tag, i))
                sim = read_sim_data(filename)
                pk = calc_pk(cosmo, sim, random)
                # store the power spectra 
                pickle.dump( pk, open( pk_file, "wb" ) )
    return calc_cov(tag, N)

However, using MPI to do 4 calculations in parallel will also take 4 times as much memory and most likely you will be limited by memory. If you have access to a supercomputer, you can just use many nodes, but if not, you can download 1000 power spectra here.

The diagonal elements of the covariance matrix represent the variance in the power spectrum if the different data points would not be correlated
\[
\sigma_{P(k_i)} = \sqrt{C(k_i,k_i)}.
\]
Using these uncertainties we can plot the BOSS measurements with error bars

Figure 2: The BOSS power spectrum measurements including
the diagonal elements of the covariance matrix as error bars.

With this, we now have the power spectra and covariance matrix, which is all we need to start using this dataset to constrain cosmological parameters. We will start with that in the next blog post.
You can find the code for this project on GitHub. Let me know if you have any questions/comments below.
cheers
Florian

Monday, October 9, 2017

Study the Universe with Python tutorial, part 2 -- power spectrum

In the first part of this series we discussed how to download the galaxy catalogue of the Baryon Oscillation Spectroscopic Survey (BOSS). We also made some plots to show the distribution of galaxies. In this blog post we will calculate a summary statistic of the dataset which is the first step to use the data to constrain cosmological parameters, like dark matter and dark energy.

As we saw in the last blog post, the BOSS dataset contains about 1 million galaxies and their distribution in the Universe. The position of these galaxies is what carries the cosmological information. Imagine that there would be more dark matter in the Universe. The additional matter would have additional gravitational force which would clump together the material. On the other hand with less dark matter, the galaxies would be more spread out.

Figure 2: Distribution of galaxies in a
Universe with 30% of its total energy
in the form of dark matter.

Figure 1: Distribution of galaxies in a
Universe with 10% of its total energy
in the form of dark matter.

Even though the two Figures above look very similar you can see that the distributed points are more spread out on the left compared to the right Figure. The only difference is that on the left we have 10% of all the energy in the Universe in the form of dark matter ($\Omega_{cdm} = 0.1$), while in the Figure on the right we have 30% of all the energy in the form of dark matter ($\Omega_{cdm} = 0.3$).

We can now compare these distributions with our data from the BOSS survey and depending on which of the two distributions looks more like the data, we can determine how much dark matter there is in the Universe.

In practice we do not want to compare the actual distribution of galaxies, but instead we compare a summary statistic, meaning a measurement from these galaxies, which carries all the information we need. There are many choices for such a summary statistic, but here we will use the power spectrum.

To calculate the power spectrum we first need to download two more catalogues

wget -N https://data.sdss.org/sas/dr12/boss/lss/random0_DR12v5_CMASSLOWZTOT_North.fits.gz -P path/to/folder/
wget -N https://data.sdss.org/sas/dr12/boss/lss/random0_DR12v5_CMASSLOWZTOT_South.fits.gz -P path/to/folder/

which contain a random distribution of points. These distributions are needed to calibrate the data catalogues.

To use these catalogues we write a quick read in function which should look like this

def read_ran(filename):
    ''' Read the random catalogues '''
    randoms_cat = nbodykit.lab.FITSCatalog(os.path.join(base_dir, filename))
    print('randoms_cat.columns = ', randoms_cat.columns)
    randoms_cat = randoms_cat[(randoms_cat['Z'] > 0.01) & (randoms_cat['Z'] < 0.9)]
    return randoms_cat

Now we can combine the data catalogues and random catalogue, assign the point distributions to a 3D grid and calculate the power spectrum, all using nbodykit.

# Combine data and random catalogue
fkp = nbodykit.lab.FKPCatalog(data, random)
# Assign point distribution to 3D grid
mesh = nbodykit.lab.fkp.to_mesh(Nmesh=512, nbar='NZ', comp_weight='WEIGHT', fkp_weight='WEIGHT_FKP')
# Calculate power spectrum (monopole only)
r = nbodykit.lab.ConvolvedFFTPower(mesh, poles=[0], dk=0.01, kmin=0.01)

Here we use a grid with 512 grid points in each of the 3 dimensions. There are also two weightings included in this calculation, the 'Weight' and 'WEIGHT_FKP' columns. The first weight refers to a correction of incompleteness in the data catalogue, while the second weight tries to optimise the signal-to-noise.

If we use this code to measure the BOSS power spectrum for the northern and southern part we get

Figure 3: Power spectrum measurements for the northern (orange)

and southern (blue) parts of the BOSS dataset.

Here we see that the power spectra of the northern and southern parts are slightly different. Whether these differences are significant is not clear though, since we don't have uncertainties on these measurements (yet).

We could now compare these measurements with models for the power spectrum like

Figure 4: Power spectrum models with different amounts of dark

matter using the class Boltzmann code.

This would allow us to determine the amount of dark matter in the Universe and would be much more practical compared to the comparison of point distributions which we looked at above.

However, before we can go ahead and constrain cosmological parameters we have to get an estimate of the uncertainty on the measurements. That will be the subject of the next post.
You can find the code for this project on GitHub.
cheers
Florian

Saturday, October 7, 2017

Study the Universe with Python tutorial, part 1 -- the BOSS dataset

In the next few blog posts, I will introduce a dataset of more than 1 million galaxies and I will show how easy it is to use this dataset to test cosmological models using Python. This tutorial is intended for non-experts or anybody who wants to get their hands on a very exciting dataset. If you are looking for an interesting Python project, look no further.

First, we need to download the data. We are going to use the dataset produced by the Baryon Oscillation Spectroscopic Survey (BOSS) collaboration which represents the largest dataset of its kind to date. You can download all necessary data with the following 2 wget commands, just provide the destination path at the end. This will download a bit less than 3Gb of zipped data which turns into 8Gb after you unzip it.

wget -N https://data.sdss.org/sas/dr12/boss/lss/galaxy_DR12v5_CMASSLOWZTOT_North.fits.gz -P path/to/folder/
wget -N https://data.sdss.org/sas/dr12/boss/lss/galaxy_DR12v5_CMASSLOWZTOT_South.fits.gz -P path/to/folder/

For this project, we need the Python package nbodykit, which can be installed via conda

conda create --name nbodykit-env python=3
source activate nbodykit-env 
conda install -c bccp nbodykit

or simply use pip

pip install nbodykit

Now we are ready to go.

We start by reading in the data catalog using the fitsio package, which is included in nbodykit

import nbodykit.lab
def read_data(filename):
    ''' Here we read the data catalogues '''
    galaxies_cat = nbodykit.lab.FITSCatalog(os.path.join(base_dir, filename))
    print('galaxies_cat.columns = ', galaxies_cat.columns)
    galaxies_cat = galaxies_cat[(galaxies_cat['Z'] > 0.01) & (galaxies_cat['Z'] < 0.9)]
    galaxies_cat['Weight'] = galaxies_cat['WEIGHT_SYSTOT'] * (galaxies_cat['WEIGHT_NOZ'] + galaxies_cat['WEIGHT_CP'] - 1.0)
    return galaxies_cat

Here we included the nbodykit.lab functions, one of which is the FITSCatalog() function, which reads all columns in the fits file and stores them in a dask array. With galaxies_cat.columns you can print out the column header of the file, which should look like this

galaxies_cat.columns = ['AIRMASS', 'CAMCOL', 'COMP', 'DEC', 'DEVFLUX', 'EB_MINUS_V', 'EXPFLUX', 'EXTINCTION', 'FIBER2FLUX', 'FIBERID', 'FIELD', 'FINALN', 'FRACPSF', 'ICHUNK', 'ICOLLIDED', 'ID', 'IMAGE_DEPTH', 'IMATCH', 'INGROUP', 'IPOLY', 'ISECT', 'MJD', 'MODELFLUX', 'MULTGROUP', 'NZ', 'PLATE', 'PSFFLUX', 'PSF_FWHM', 'RA', 'RERUN', 'RUN', 'R_DEV', 'SKYFLUX', 'SPECTILE', 'Selection', 'TILE', 'Value', 'WEIGHT_CP', 'WEIGHT_FKP', 'WEIGHT_NOZ', 'WEIGHT_SEEING', 'WEIGHT_STAR', 'WEIGHT_SYSTOT', 'Weight', 'Z']

In this tutorial, we will use only a subset of the columns given here. Namely, we will only use the positions of the galaxies given by the angles on the sky in 'RA' (right ascension) and 'DEC' (declination) as well as the redshift 'Z'. The redshift describes how much the light spectrum of the galaxy has been shifted in wavelength due to the expansion of the Universe. Most of the galaxies in this dataset are several billion light-years away from us, meaning the light travelled a significant fraction of the existence of the Universe before reaching us. Galaxies which are further away have a larger redshift and we will use the redshift later together with the right ascension and declination to get a 3D position for all galaxies.

Let's explore the dataset a bit before we go any further. We can read in the data

filename = 'galaxy_DR12v5_CMASSLOWZTOT_South.fits'
data_south = read_data(filename)
filename = 'galaxy_DR12v5_CMASSLOWZTOT_North.fits'
data_north = read_data(filename)
plot_data(data_south, data_north)

and plot the right ascension and declination

def plot_data(cat1, cat2):
    ''' Plot the catalogues '''
    cat1 = down_sample(cat1, 1000)
    cat1['RA'] -= 180
    cat2 = down_sample(cat2, 1000)
    cat2['RA'] -= 180

    plt.clf()
    plt.subplot(projection="aitoff")
    plt.title("BOSS DR12 survey footprint", y=1.1)
    plt.plot(np.radians(cat1['RA'].compute()), np.radians(cat1['DEC'].compute()), '.')
    plt.plot(np.radians(cat2['RA'].compute()), np.radians(cat2['DEC'].compute()), '.')
    plt.grid(True)
    plt.show()

Here we read in the two data catalogues, down-sample these catalogues (scatter plots with 1 million points will be difficult to handle) and convert the angles from degrees to radians. The down_sample() function is defined as

def down_sample(cat, N):
    x = np.random.uniform(size=cat.size)
    return cat[x < float(N) / cat.size]

We make a plot by projecting the data onto a sphere using Aitoff projection.

Figure 1: The sky distribution of galaxies in the BOSS galaxy survey using Aitoff projection.

Only 1000 points for each region are plotted.

This plot shows the sky coverage of the dataset. The galaxy survey observed in two patches on the sky, which explains why we have two files. We will call them north and south from now on even though east and west might be more intuitive from this figure. However, this is just because of the aitoff projection. In fact, the plane of the Milky Way galaxy does go right in between the two patches, so that the southern part (blue) is indeed south of the Milky Way and the northern part (yellow) is north of it.

The northern part has about twice as many galaxies as the southern part and in total this dataset covers about 1 quarter of the sky.

We can also plot the redshift distribution

bins = np.arange(0., 0.9, 0.01)
plt.hist(cat1['Z'].compute(), weights=cat1['Weight'].compute(), bins=bins, alpha=0.4, label='BOSS South')
plt.hist(cat2['Z'].compute(), weights=cat2['Weight'].compute(), bins=bins, alpha=0.4, label='BOSS North')
plt.xlabel("redshift")
plt.legend(loc=0)
plt.show()

which looks like this

Figure 2: Redshift distribution of the BOSS galaxy sample. Redshift

0.7 corresponds to about 6 000 000 000 (6 billion) light-years. So when the light from

these galaxies was sent out the Universe was only about 7 billion years old,
while now it is 13 billion years old.

The next step is to turn the RA, DEC, z system into Cartesian coordinates. However, this transformation depends on the cosmological model. Using the current standard model of cosmology (also called $\Lambda$CDM) we can set up a cosmological framework by fixing a small set of parameters
(1) The amount of baryonic matter (Omega0_b)
(2) The amount of cold dark matter (Omega0_cdm)

(3) The Hubble parameter h (or more precisely the Hubble parameter divided by 100, h = H/100)

In principle, there are more free parameters, but these are the ones that matter to this analysis. Using nbodykit we can set a cosmology with

cosmo = nbodykit.lab.cosmology.Cosmology(Omega0_cdm=0.265, Omega0_b=0.045, h=0.7)

Here we also specified that we want to use a flat cosmology, which has no curvature which means that the dark energy parameter is automatically fixed through

\[
\Omega_{DarkEnergy} = 1 - \Omega_{cdm} + \Omega_b.
\]These parameters constrain the expansion history of the Universe, including its size and age. The dataset we just read with our Python script is powerful enough to constrain these parameters and therefore teach us something about dark matter and dark energy. We will get to that in a later post.

Here we want to convert the angles (RA, DEC) together with the redshift to a 3D position for the galaxies. To do this we can just call a function provided by nobodykit

galaxies_cat_1['Position'] = nbodykit.transform.SkyToCartesian(galaxies_cat_1['RA'],\
    galaxies_cat_1['DEC'], galaxies_cat_1['Z'], cosmo=cosmo)
galaxies_cat_2['Position'] = nbodykit.transform.SkyToCartesian(galaxies_cat_2['RA'],\
    galaxies_cat_2['DEC'], galaxies_cat_2['Z'], cosmo=cosmo)

with this, we can now make a 3D plot of all galaxies

Figure 3: 3D distribution of BOSS galaxies. Only 1000 randomly selected

galaxies out of more than 1 million in this dataset are included in this plot.

Milky Way is located in the middle and the galaxies in the northern (yellow)

and southern (blue) part are observed in two opposite directions.

And this concludes this first look at the BOSS dataset. This dataset has only been released last year (2016) and represents cutting edge cosmological research. How to use this dataset to constrain dark matter and dark energy will be the subject of the next blog posts.

The entire Python code for this project is available on GitHub.

cheers
Florian