My statistics blog: Study the Universe with Python tutorial, part 3 -- covariance matrix

In the last two blog posts, we discussed how to download the galaxy catalogues of the Baryon Oscillation Spectroscopic Survey (BOSS) and calculate the power spectrum of this dataset. In this blog post, we will calculate uncertainties on the power spectrum measurement.

Uncertainties can be obtained in many different ways. Here we will use simulations of the BOSS dataset. Members of the BOSS collaboration have used supercomputers to simulate what a Universe could look like and they have generated thousands of such simulated BOSS datasets. All we have to do is to calculate the power spectrum of these thousands of simulations and measure the variance between them.

We don't just calculate the uncertainty; we also calculate the correlation between different data points. All the information we need is contained in the covariance matrix, which is defined as
\[C(k_i,k_j) = \begin{pmatrix} \sigma^2_{k_1} & cor(k_1,k_2)\sigma_{k_1}\sigma_{k_2} & \dots & cor(k_1,k_n)\sigma_{k_1}\sigma_{k_n} \\
cor(k_2,k_1)\sigma_{k_2}\sigma_{k_1} & \sigma^2_{k_2} & \dots & cor(k_2,k_n)\sigma_{k_2}\sigma_{k_n} \\
\vdots & \vdots & \ddots & \vdots \\ cor(k_n,k_1)\sigma_{k_n}\sigma_{k_1} & cor(k_n,k_2)\sigma_{k_n}\sigma_{k_2} & \dots & \sigma^2_{k_n} \end{pmatrix}\]Here $\sigma_{k_i}$ is the standard deviation of the power spectrum $P(k_i)$ at wavelength $k_i$. The off-diagonal elements of this matrix contain the uncertainties on a data point contributed by neighbouring data points, which are proportional to the correlation $cor(k_i,k_j)$ between the two data points.

First, we download the simulated BOSS catalogues together with the corresponding random catalogue. Since there are so many of them, this is quite a large file

wget -N https://data.sdss.org/sas/dr12/boss/lss/dr12_multidark_patchy_mocks/Patchy-Mocks-DR12SGC-COMPSAM_V6C.tar.gz -P path/to/folder/
wget -N https://data.sdss.org/sas/dr12/boss/lss/dr12_multidark_patchy_mocks/Patchy-Mocks-DR12NGC-COMPSAM_V6C.tar.gz -P path/to/folder/
wget -N https://data.sdss.org/sas/dr12/boss/lss/dr12_multidark_patchy_mocks/Patchy-Mocks-Randoms-DR12SGC-COMPSAM_V6C_x50.tar.gz -P path/to/folder/
wget -N https://data.sdss.org/sas/dr12/boss/lss/dr12_multidark_patchy_mocks/Patchy-Mocks-Randoms-DR12NGC-COMPSAM_V6C_x50.tar.gz -P path/to/folder/

All 2048 data catalogues can be paired with one random catalogue to measure the power spectrum of the simulated catalogues.

Given that the simulated catalogues are in CSV format (rather than the data catalogues in FITS format), we have to write a new read function. using the CSVCatalog() function from nbodykit

def read_sim_data(filename):
    ''' Read the random simulation catalogues from CSV '''
    sim_cat = CSVCatalog(os.path.join(base_dir, filename),\
        names=['RA', 'DEC', 'Z', 'dummy1', 'NZ', 'dummy2', 'veto', 'Weight'])
    print('sim_cat.columns = ', sim_cat.columns)
    sim_cat = sim_cat[(sim_cat['Z'] > zmin) & (sim_cat['Z'] < zmax)]
    sim_cat = sim_cat[(sim_cat['veto'] > 0)]
    sim_cat['WEIGHT_FKP'] = 1./(1. + 10000.*sim_cat['NZ']);
    return sim_cat

def read_sim_ran(filename):
    ''' Read the simulation catalogues from CSV '''
    ran_cat = CSVCatalog(os.path.join(base_dir, filename),\
        names=['RA', 'DEC', 'Z', 'NZ', 'dummy1', 'veto', 'Weight'])
    print('ran_cat.columns = ', ran_cat.columns)
    ran_cat = ran_cat[(ran_cat['Z'] > zmin) & (ran_cat['Z'] < zmax)]
    ran_cat = ran_cat[(ran_cat['veto'] > 0)]
    ran_cat['WEIGHT_FKP'] = 1./(1. + 10000.*ran_cat['NZ']);
    return ran_cat

For both catalogues, we have to ensure that the veto column is non zero and we have to recalculate the FKP weights.

With this, we can now read in all the catalogues, calculate the power spectra and from that, we can derive a covariance matrix. We define a process_sims() function like

def process_sims(cosmo, tag, N):
    ''' Calculate the covariance matrix using the simulated BOSS datasets '''
    # Read the random catalogue which can be paired with all of the simulated catalogues
    filename = 'Patchy-Mocks-Randoms-DR12%s-COMPSAM_V6C_x50.dat' % tag
    random = read_sim_ran(filename)

    for i in range(1, N):
        pk_file = output_path + "/sims/pk_%s_%d.pickle" % (tag, i)
        print('calculating %s... ' % pk_file)
        # Only calculate power spectra which we don't have on disk
        if not os.path.isfile(pk_file):
            filename = 'Patchy-Mocks-DR12%s-COMPSAM_V6C/Patchy-Mocks-DR12%s-COMPSAM_V6C_%0.4d.dat' % (tag, tag, i)
            sim = read_sim_data(filename)
            pk = calc_pk(cosmo, sim, random)
            # store the power spectra 
            pickle.dump( pk, open( pk_file, "wb" ) )
    return calc_cov(tag, N)

This function gets a cosmological model as input as well as a string tag and an integer $N$. The tag is either 'SGC' or 'NGC' and determines whether we process the simulated catalogues for the northern part or the southern part. The integer $N$ determines the number of catalogues we process.

Since the calculation of all 2048 power spectra can take many hours, I use pickle to write each power spectrum to disk. The function then loops from $1$ to $N$, tests whether the power spectrum has already been calculated, and if not it will calculate the power spectrum using the same function we defined in the last blog post.

Finally, we calculate the covariance matrix using the calc_cov() function, which simply reads the power spectra, subtracts the shot noise, limits the range to $< k_{max} = 0.3h/$Mpc and makes use of the numpy.cov() function.

def calc_cov(tag, N):
    ''' Read simulation power spectra and return covariance matrix '''
    list_of_pks = []
    for i in range(1, N):
        pk_file = output_path + "/sims/pk_%s_%d.pickle" % (tag, i)
        pk = pickle.load( open( pk_file, "rb" ) )
        P = pk['power_0'].real - pk.attrs['shotnoise']
        # Limit the k range
        P = P[(pk['k'] < kmax)]
        list_of_pks.append(P)
    return np.cov(np.vstack(list_of_pks).T)

We can turn the covariance matrix into a correlation matrix
\[
R = \frac{C_{ij}}{\sqrt{C_{ij}C_{ji}}}
\] and plot it as

from statsmodels.stats.moment_helpers import cov2corr

def plot_cov(matrix):
    ''' Plot the correlation matrix derived from the covariance matrix '''
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    plt.imshow(cov2corr(matrix))
    plt.colorbar()
    plt.show()
    return

which should look like this

Figure 1: Correlation matrix, derived from the covariance

matrix of the northern part of BOSS using 1000 simulated

catalogues.

Figure 1 shows that there is some correlation between the power spectrum bins close to each other, while some bins which are far apart are anti-correlated.

Note that calculating 2048 power spectra will probably take a while and you might want to make use of the MPI capabilities of nbodykit. MPI stands for Massage Passing Interface and allows to run jobs in parallel. If you have 4 cores in your machine, you can theoretically calculate 4 power spectra in parallel.

To do this we have to rewrite the process_sims() function to

def process_sims_with_MPI(cosmo, tag, N):
    ''' Calculate the covariance matrix using the simulated BOSS datasets '''
    # this splits the communicator into chunks of size of roughly N
    with nbodykit.lab.TaskManager(1, use_all_cpus=True) as tm:
        # Read the random catalogue which can be paired with
        # all of the simulated catalogues
        filename = 'Patchy-Mocks-Randoms-DR12%s-COMPSAM_V6C_x50.dat' % tag
        random = read_sim_ran(filename)
        # loop over each box in parallel
        for i in tm.iterate(range(1, N)):
            pk_file = output_path + "/sims/pk_%s_%d.pickle" % (tag, i)
            print('calculating %s... ' % pk_file)
            # Only calculate power spectra which we don't have on disk
            if not os.path.isfile(pk_file):
                filename = ("Patchy-Mocks-DR12%s-COMPSAM_V6C/Patchy-"
                            Mocks-DR12%s-COMPSAM_V6C_%0.4d.dat" % (tag, tag, i))
                sim = read_sim_data(filename)
                pk = calc_pk(cosmo, sim, random)
                # store the power spectra 
                pickle.dump( pk, open( pk_file, "wb" ) )
    return calc_cov(tag, N)

However, using MPI to do 4 calculations in parallel will also take 4 times as much memory and most likely you will be limited by memory. If you have access to a supercomputer, you can just use many nodes, but if not, you can download 1000 power spectra here.

The diagonal elements of the covariance matrix represent the variance in the power spectrum if the different data points would not be correlated
\[
\sigma_{P(k_i)} = \sqrt{C(k_i,k_i)}.
\]
Using these uncertainties we can plot the BOSS measurements with error bars

Figure 2: The BOSS power spectrum measurements including
the diagonal elements of the covariance matrix as error bars.

With this, we now have the power spectra and covariance matrix, which is all we need to start using this dataset to constrain cosmological parameters. We will start with that in the next blog post.
You can find the code for this project on GitHub. Let me know if you have any questions/comments below.
cheers
Florian

My statistics blog

Wednesday, October 11, 2017

Study the Universe with Python tutorial, part 3 -- covariance matrix

No comments:

Post a Comment