Thursday, October 12, 2017

Study the Universe with Python tutorial, part 5 -- Monte Carlo Markov Chain

This is the fifth blog post in this series which discusses the Baryon Oscillation Spectroscopic dataset (BOSS). In the last 4 posts, we downloaded the data, calculated the power spectrum and covariance matrix and isolated the BAO feature. Now we want to apply this procedure to the data and extract cosmological information.

In the last post, we used the scipy minimize function to find the best-fitting model. However, the best-fitting model is not very useful if we don't know the uncertainties attached to it.

What we really need is a probability for each possible model which could describe our Universe. The probability for model $Y$ with parameters x should reflect the probability that we would observe the BOSS dataset if the Universe were described by $Y(x)$. Such a probability distribution is called a likelihood and a very powerful way to get such a likelihood is a Monte Carlo Markov Chain (MCMC).

What is an MCMC? An MCMC is a chain of parameter sets which has the nice property that the appearance of each parameter in the chain also represents its likelihood distribution. This means one can take the chain, produce a histogram of parameter x which directly provides the (marginalised) likelihood for parameter x (marginalised just means that you integrate over the likelihood of all other parameters).

There are many algorithms which can produce such a chains, but the most basic and well-known is the Metropolis-Hastings algorithm. Here we will just focus on the Python implementation of such a procedure, but if you have never encountered MCMC it is certainly worth learning about it.

We will use the MCMC emcee package which can be installed with
>> pip install emcee
and included with
import emcee
We can call this MCMC sampler like
nwalkers = 20
dim = len(start)
expected_error = [0.1, 1., 1., 1., 1., 1., 0.05, 0.1]
sampler = emcee.EnsembleSampler(nwalkers=nwalkers, dim=dim, lnpostfn=get_loglike,\
 args=(boss_data, { 'noBAO': Pk_without_BAO, 'os_model': os_model }, get_shifted_model))
start_positions = [(result['x'] + (2.*np.random.random_sample((dim,)) - 1)*expected_error) for i in range(nwalkers)]
sampler.run_mcmc(start_positions, 1000)
Here we have to specify the number of walkers (nwalkers) since this MCMC package will produce several MCMC chains in parallel (each walker produces one chain). The separate chains produced by each walker are not really independent, since they communicate with each other (Goodman & Weare, 2010). We also need to specify the starting values for the different walkers. To do that we use the results from the minimisation procedure discussed in the last post and add a random variation around it. Different to our minimisation function in the last blog post, this algorithm considers high values a good match. Therefore we have to write a small function to turn our $\chi^2$ into a log-likelihood, which is defined as
def get_loglike(parameters, data, templates, func):
    return -0.5*calc_chi2(parameters, data, templates, func)
The run_mcmc() method produces a chain with 1000 rows times the number of walkers.

The problem with MCMC is that it requires a certain time to converge. This means the number of appearances of a parameter only reflects the likelihood if the chain has converged. Determining the convergence is very tricky, and there is no procedure which is guaranteed to detect the correct convergence. Here we will use the Gelman & Rubin criterium.

The Gelman & Rubin criterium compares several independent chains and tries to determine whether the variances within these chains are similar. Here we will not discuss the details of the implementation, but you can read more about it in this interesting blog post.

To use the Gelman & Rubin criterium we need to use several MCMC chains and compare them. The chains are independent, thus usually ran in parallel to save time. This means we have to re-write the MCMC procedure
def runMCMC(start, data, templates):
    ''' Perform MCMC '''
    dim = len(start)
    Nchains = 4
    nwalkers = 20
    ichaincheck = 400
    minlength = 2000
    epsilon = 0.01

    labels = ['b', 'A1', 'A2', 'A3', 'A4', 'A5', 'alpha', 'sigmaNL']
    expected_error = [0.1, 1., 1., 1., 1., 1., 0.05, 0.1]
    # Set up the sampler.
    pos=[]
    list_of_samplers=[]
    for jj in range(0, Nchains):
        pos.append([start + (2.*np.random.random_sample((dim,)) - 1.)*expected_error for i in range(nwalkers)])
        list_of_samplers.append(emcee.EnsembleSampler(nwalkers=nwalkers, dim=dim, lnpostfn=get_loglike,\
         args=(data, templates, get_shifted_model)))

    # Start MCMC
    print("Running MCMC... ")
    within_chain_var = np.zeros((Nchains, dim))
    mean_chain = np.zeros((Nchains, dim))
    scalereduction = np.arange(dim, dtype=np.float)
    scalereduction.fill(2.)

    itercounter = 0
    chainstep = minlength
    while any(abs(1. - scalereduction) > epsilon):
        itercounter += chainstep
        for jj in range(0, Nchains):
            for result in list_of_samplers[jj].sample(pos[jj], iterations=chainstep, storechain=True):
                pos[jj] = result[0]
            # we do the convergence test on the second half of the current chain (itercounter/2)
            within_chain_var[jj], mean_chain[jj] = prep_gelman_rubin(list_of_samplers[jj])
        scalereduction = gelman_rubin_convergence(within_chain_var, mean_chain, int(itercounter/2))
        print("scalereduction = ", scalereduction)
        chainstep = ichaincheck
    # Investigate the chain and print out some metrics
    inspect_chain(list_of_samplers, labels)
    return 
This function basically does the same thing as the function we had before, but it now runs within a loop, which is only broken when the Gelman & Rubin parameters in the scalereduction array are within a certain range of unity.

The inspect_chain() function provides several diagnostics, with which we can further test whether the chain has actually converged. For example, we can print out the timeline of the sampling range for each parameter
fig, axes = plt.subplots(dim, 1, sharex=True, figsize=(8, 9))
for i in range(0, dim):
    for jj in range(0, Nchains):
        axes[i].plot(list_of_samplers[jj].chain[:, :, i].T, alpha=0.4)
    #axes[i].yaxis.set_major_locator(MaxNLocator(5))
    axes[i].set_ylabel(labels[i])
fig.tight_layout(h_pad=0.0)
fig.savefig("%stime_series.png" % output_path)
which gives
Figure 1: Variance in the different parameters. The convergence
has been reached if the variance is constant with time. In the early
stages of the chain the variance changes, indicating that the
convergence has not been reached.
This figure shows that the sampling range varies in the beginning, which indicates that the chain has not converged yet. The second half of the chain has an almost constant variance which is a good indication of convergence.

If we convince ourselves that the chain has indeed converged we can plot the likelihood. Here we use corner.py which can be installed with
>> pip install corner
and included with
import corner
We can then feed our chain into this plotting package as
fig = corner.corner(mergedsamples, quantiles=[0.16, 0.5, 0.84], plot_density=False,\
    show_titles=True, title_fmt=".3f", labels=labels)
fig.savefig("%scorner.png" % output_path)
which gives
Figure 2: 2D likelihood distributions for the 7 parameters in our fit.
The 1D marginalised likelihood is given on the right together with the
$68\%$ confidence levels. 
Finally, we can also print out the constraints with
def get_percentiles(chain, labels):
    ''' Calculate constraints and uncertainties from MCMC chain '''
    per = np.percentile(chain, [50., 15.86555, 84.13445, 2.2775, 97.7225], axis=0)
    per = [per[0], per[0]-per[1], per[2]-per[0], per[0]-per[3], per[4]-per[0]]
    for i in range(0, len(per[0])):
        print("%s = %f +%f -%f +%f -%f" % (labels[i], per[0][i], per[1][i], per[2][i], per[3][i], per[4][i]))
    return per
and with the BOSS data, we should get something like
>> b = 2.295306 +0.314054 -0.312351 +0.584276 -0.610636
A1 = -0.151418 +0.111546 -0.111835 +0.223292 -0.232080
A2 = 41.676085 +16.889086 -17.505383 +30.571086 -37.981033
A3 = -3306.605503 +1642.375097 -1467.782597 +3487.705208 -2538.709703
A4 = 14835.170765 +5912.070481 -6608.236429 +10122.305967 -14237.721237
A5 = -21141.535952 +9150.798840 -8221.692166 +20105.713945 -14085.031087
alpha = 0.990088 +0.032634 -0.027241 +0.077669 -0.059837
sigmaNL = 11.378695 +2.975023 -4.083891 +5.036540 -7.682274
Now what does that mean in terms of cosmological parameters? Well as mentioned before, the only parameter of interest here is the scaling parameter $\alpha$. This parameter is defined as
\[
\alpha = \frac{D_V}{D^{\rm fid}_V}
\]where the the distance $D_V$ is defined as
\[
D_V = \left[(1+z)^2D^2_A\frac{cz}{H(z)}\right]^{1/3}
\]The superscribed $^{\rm fid}$ just means that this is the same quantity in the fiducial cosmology, which is the one we defined in the cosmo class in nbodykit.

The distance $D_V(z)$ is an average of the angular diameter distance $D_A(z)$ and the Hubble parameter $H(z)$
\[
\begin{align}
D_A(z) &= \frac{c}{1+z}\int^z_0\frac{dz'}{H(z')}\\
H(z) &= H_0\left[(\Omega_cdm+\Omega_b)(1+z)^3 + \Omega_{\Lambda}\right]^{1/2}
\end{align}
\]With this we can now turn the constraint on $\alpha$ into a constraint on $D_V$
Hz = cosmo.efunc(zmean)*cosmo.H0*cosmo.h
DC = (1.+zmean)*cosmo.angular_diameter_distance(zmean)/cosmo.h
c_km_s = (speed_of_light/1000.)
DVfid = ( DC**2*(zmean*c_km_s/Hz) )**(1./3.)
DV = per[:,6]*DVfid
print("DV = %f +%f -%f +%f -%f\n" % (DV[0], DV[1], DV[2], DV[3], DV[4]))
Which should show something like
>> DV = 2152.179924 +69.092697 -57.551602 +164.343396 -121.202709
The next step would be to combine this constraint with measurements from the Cosmic Microwave Background (CMB), to get constraints on actual cosmological parameters. But for now, I will leave it here.

Thanks a lot for making it to the end of this series. If you are interested in going further, you should consider a few extensions to this analysis:
  • Have a look at density field reconstruction
  • Consider an anisotropic analysis to constrain $D_A(z)$ and $H(z)$ independently
  • Do you think the polynomial model we use is the best choice? Can you come up with a better model?
  • Use the simulated BOSS catalogues to test whether our procedure is robust
The final code for this analysis is available on GitHub.
cheers
Florian

Study the Universe with Python tutorial, part 4 -- minimizing $\chi^2$

This is the fourth blog post about how to use the Baryon Oscillation Spectroscopic (BOSS) dataset to constrain cosmology. In the first three posts, we downloaded the data, calculated the power spectrum and obtained a covariance matrix.

The power spectrum we measured carries a huge amount of information about cosmological parameters, but most of that information cannot be accessed easily. The reason is that we cannot calculate reliable models for the power spectrum, to compare to the data. In a previous post, we calculated models using the class code, but these models are only accurate on very large scales (small wavenumbers).

For that reason, we will focus here on one particular signal, which can be described with a very simple model. This signal is called Baryon Acoustic Oscillations (BAO). BAO in the galaxy distribution were detected for the first time in 2005 and since that first detection, BAO have become one of the most powerful cosmological probes. In fact, the BAO measurement is one of the main reasons why the BOSS dataset has been assembled.

The BAO signal shows up as an oscillation signal on top of the power spectrum. In a previous plot, we saw model power spectra for different amounts of dark matter
Figure 1: Model power spectrum for different amounts
of dark matter.

You can see the BAO signal as an oscillation. This is the signal we are after. The BAO signal is much more robust (easier to model) compared to the rest of the power spectrum and, therefore, we will use a model for the power spectrum where we get rid of all but the oscillation signal.

We can do that by fitting a power spectrum without a BAO signal to a power spectrum with BAO signal. Fitting just means that we test many different power spectra and pick the one which matches the best.

The nbodykit package provides power spectra without BAO signal, all that we have to do is call the modelling function we used earlier but include 'numerical_nowiggle': 'yes', to the parameter file. We can than extract the power spectrum without wiggles as 
Pno_wiggle = np.array([cosmo.pk_numerical_nw(ki, z) for ki in k])
And if we plot such power spectrum with the models we had earlier we get
Figure 2: Same power spectra as in Figure 1, but calculated without
BAO signal (dashed lines)

Comparing the dashed and solid lines in Figure 2 shows that these power spectra are the same except for the BAO wiggles, which have been removed.

With this we can now get a model for the BAO signal only
from scipy.interpolate import interp1d
def get_oscillation(krange, k, pk, pk_nw):
    ''' Get an oscillation only power spectrum '''
    pk_interp = lambda x: interp1d(k, pk)(x)
    pk_nw_interp = lambda x: interp1d(k, pk_nw)(x)
    yolin = pk_interp(krange)/pk_nw_interp(krange)
    return yolin
Here is a potential output of the get_oscillation() function
Figure 3: The galaxy power spectrum divided by the best fitting
no-BAO power spectrum. With this technique, we can isolate the
BAO feature.

Our procedure removed the entire broadband power spectrum and isolated the oscillation signature. Before we apply this to the data, we have to extend the model
def get_shifted_model(parameters, x, templates):
    ''' Calculate a model including a scaling parameter '''
    model = get_smooth_model(parameters, x, templates)
    return model*(1. + (templates['os_model'](x/parameters[6]) - 1.)*np.exp(-0.5*x**2*parameters[7]**2))
with
def get_smooth_model(parameters, x, templates):
    ''' Combine a noBAO model with polynomials and linear bias '''
    polynomials = parameters[1]/x**3 + parameters[2]/x**2 + parameters[3]/x + parameters[4] + parameters[5]*x
    return parameters[0]*parameters[0]*templates['noBAO'](x) + polynomials
Here we get the smooth model which we use to marginalise over the broadband power spectrum. We also introduce a smoothing scale (parameters[7]) and a scaling parameter (parameters[6]). The scaling parameter is the parameter we are mainly interested in, since it carries the wavelength of the oscillation signature and is the only parameter here which we will exploit for cosmological constraints.

Now let's apply this to the data. For that we need a metric which determined the goodness of fit. Here we use the $\chi^2$
def calc_chi2(parameters, data, templates, func):
    ''' Compares the model with the data '''
    if within_priors(parameters):
        chi2 = 0.
        # Loop over all datasets which are fit together
        for dataset in data:
            model = func(parameters, dataset['k'], templates)
            diff = (model - dataset['pk'])
            chi2 += np.dot(diff,np.dot(dataset['cov_inv'],diff))
        return chi2
    else:
        return 100000.
After storing the BOSS data as
    boss_data = [{ 'k': pk_south['k'][krange_mask], 'pk': P_south[krange_mask],\
     'label': 'BOSS south', 'cov': C_south, 'cov_inv': C_south_inv},\
     { 'k': pk_north['k'][krange_mask], 'pk': P_north[krange_mask],\
     'label': 'BOSS north', 'cov': C_north, 'cov_inv': C_north_inv }]
we can call the following function
import scipy.optimize as op
from scipy.interpolate import interp1d
def fit(boss_data):
    '''
    Find the best fitting model
    '''
    k, pk, pk_nw = get_Pk(0.3)
    pk_nw_interp = interp1d(k, pk_nw)
    krange = np.arange(0.001, 0.5, 0.001)
    os_model = get_oscillation(krange, k, pk, pk_nw)
    start = [3.13, 0.005, 11., -1384, 7681., -12612., 1., 10.]
    result = op.minimize(calc_chi2, start, args=(boss_data, { 'noBAO': pk_nw_interp, 'os_model': os_model }, get_shifted_model))
    print("result['x'] = ", result['x'])
    return result
which returns the best fitting parameters. If you plot the best fitting model against the data you should get something like
Figure 4: BOSS measurements (data points) together with the
best fitting model.
And the best fitting parameters are

result['x'] =  [ 3.13331263e+00  5.02073064e-03  1.12619733e+01 -1.38476373e+03

  7.68158635e+03 -1.26123063e+04  9.95567388e-01  1.07156063e+01]

The problem is that we have no uncertainties attached to these parameters; we only have the best fitting values. Without uncertainties, we cannot use this to obtain cosmological constraints. So the next blog post will extend this by not just finding the best fitting parameters, but also getting the uncertainties.

cheers
Florian

Wednesday, October 11, 2017

Study the Universe with Python tutorial, part 3 -- covariance matrix



In the last two blog posts, we discussed how to download the galaxy catalogues of the Baryon Oscillation Spectroscopic Survey (BOSS) and calculate the power spectrum of this dataset. In this blog post, we will calculate uncertainties on the power spectrum measurement.

Uncertainties can be obtained in many different ways. Here we will use simulations of the BOSS dataset. Members of the BOSS collaboration have used supercomputers to simulate what a Universe could look like and they have generated thousands of such simulated BOSS datasets. All we have to do is to calculate the power spectrum of these thousands of simulations and measure the variance between them.

We don't just calculate the uncertainty; we also calculate the correlation between different data points. All the information we need is contained in the covariance matrix, which is defined as
\[C(k_i,k_j) = \begin{pmatrix} \sigma^2_{k_1} & cor(k_1,k_2)\sigma_{k_1}\sigma_{k_2} & \dots & cor(k_1,k_n)\sigma_{k_1}\sigma_{k_n} \\
cor(k_2,k_1)\sigma_{k_2}\sigma_{k_1} & \sigma^2_{k_2} & \dots & cor(k_2,k_n)\sigma_{k_2}\sigma_{k_n} \\
\vdots & \vdots & \ddots & \vdots \\ cor(k_n,k_1)\sigma_{k_n}\sigma_{k_1} & cor(k_n,k_2)\sigma_{k_n}\sigma_{k_2} & \dots & \sigma^2_{k_n} \end{pmatrix}\]Here $\sigma_{k_i}$ is the standard deviation of the power spectrum $P(k_i)$ at wavelength $k_i$. The off-diagonal elements of this matrix contain the uncertainties on a data point contributed by neighbouring data points, which are proportional to the correlation $cor(k_i,k_j)$ between the two data points.

First, we download the simulated BOSS catalogues together with the corresponding random catalogue. Since there are so many of them, this is quite a large file
wget -N https://data.sdss.org/sas/dr12/boss/lss/dr12_multidark_patchy_mocks/Patchy-Mocks-DR12SGC-COMPSAM_V6C.tar.gz -P path/to/folder/
wget -N https://data.sdss.org/sas/dr12/boss/lss/dr12_multidark_patchy_mocks/Patchy-Mocks-DR12NGC-COMPSAM_V6C.tar.gz -P path/to/folder/
wget -N https://data.sdss.org/sas/dr12/boss/lss/dr12_multidark_patchy_mocks/Patchy-Mocks-Randoms-DR12SGC-COMPSAM_V6C_x50.tar.gz -P path/to/folder/
wget -N https://data.sdss.org/sas/dr12/boss/lss/dr12_multidark_patchy_mocks/Patchy-Mocks-Randoms-DR12NGC-COMPSAM_V6C_x50.tar.gz -P path/to/folder/
All 2048 data catalogues can be paired with one random catalogue to measure the power spectrum of the simulated catalogues. Note that you can download the power spectrum from my website if you don't want to calculate it yourself or if you run into disk capacity problems. To read these files you can use the read_power() function provided here.

Using these power spectra, we calculate the covariance matrix using the calc_cov() function, which simply reads the power spectra, subtracts the shot noise, limits the range to $< k_{max} = 0.3h/$Mpc and makes use of the numpy.cov() function.
def calc_cov(N, tag='NGC'):
    ''' Read simulation power spectra and return covariance matrix '''
    list_of_pks = []
    for i in range(1, N):
        pk_file = "./Patchy_V6C_BOSS_DR12_%s_z3_1/ps1D_patchy_%s_z3_COMPnbar_TSC_V6C_%d_0.5_0.75_700_700_700_400_renorm.dat" % (tag, tag, i)
        pk = read_power(pk_file)
        P = pk['pk0'] - pk['SN(data+ran)']
        # Limit the k range
        k = pk['k']
        krange_mask = (k > 0.01) & (k < 0.3)
        P = P[krange_mask]
        list_of_pks.append(P)
    return np.cov(np.vstack(list_of_pks).T)
We can turn the covariance matrix into a correlation matrix
\[
R = \frac{C_{ij}}{\sqrt{C_{ij}C_{ji}}}
\] and plot it as
from statsmodels.stats.moment_helpers import cov2corr
def plot_cov(matrix):
    ''' Plot the correlation matrix derived from the covariance matrix '''
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    plt.imshow(cov2corr(matrix))
    plt.colorbar()
    plt.show()
    return 
which should look like this
Figure 1: Correlation matrix, derived from the covariance
matrix of the northern part of BOSS using 1000 simulated
catalogues.
Figure 1 shows that there is some correlation between the power spectrum bins close to each other, while some bins which are far apart are anti-correlated.

The diagonal elements of the covariance matrix represent the variance in the power spectrum if the different data points would not be correlated
\[
\sigma_{P(k_i)} = \sqrt{C(k_i,k_i)}.
\]
Using these uncertainties, we can plot the BOSS measurements with error bars
Figure 2: The BOSS power spectrum measurements including
the diagonal elements of the covariance matrix as error bars.

With this, we now have the power spectra and covariance matrix, which is all we need to start using this dataset to constrain cosmological parameters. We will start with that in the next blog post.
You can find the code for this project on GitHub. Let me know if you have any questions/comments below.
cheers
Florian

Monday, October 9, 2017

Study the Universe with Python tutorial, part 2 -- power spectrum

In the first part of this series we discussed how to download the galaxy catalogue of the Baryon Oscillation Spectroscopic Survey (BOSS). We also made some plots to show the distribution of galaxies. In this blog post, we will calculate a summary statistic of the dataset, which is the first step to use the data to constrain cosmological parameters, like dark matter and dark energy.

As we saw in the last blog post, the BOSS dataset contains about 1 million galaxies and their distribution in the Universe. The position of these galaxies is what carries the cosmological information. Imagine that there would be more dark matter in the Universe. The additional matter would have additional gravitational force, which would clump together the material. On the other hand, with less dark matter, the galaxies would be more spread out.

We can now compare these distributions with our data from the BOSS survey, and depending on which of the two distributions looks more like the data, we can determine how much dark matter there is in the Universe.

In practice, we do not want to compare the actual distribution of galaxies, but instead we compare a summary statistic, meaning a compression of these galaxies, which carries all the information we need. There are many choices for such a summary statistic, but here we will use the power spectrum.

To calculate the power spectrum, we first need to download two more catalogues
wget -N https://data.sdss.org/sas/dr12/boss/lss/random0_DR12v5_CMASSLOWZTOT_North.fits.gz -P path/to/folder/
wget -N https://data.sdss.org/sas/dr12/boss/lss/random0_DR12v5_CMASSLOWZTOT_South.fits.gz -P path/to/folder/
which contain a random distribution of points. These distributions are needed to calibrate the data catalogues.

To read this file, you can use the read_data() function we defined earlier (the file is significantly larger and it can take a while to read it into memory). Now we can combine the data catalogues and random catalogue, assign the point distributions to a 3D grid and calculate the power spectrum, all using triumvirate.
from astropy import units as u
from astropy.coordinates import Distance
from astropy.cosmology import Planck15

from triumvirate.parameters import fetch_paramset_template
from triumvirate.catalogue import ParticleCatalogue
from triumvirate.twopt import compute_powspec
def measure_pk(cat_data, cat_random, tag=''):
    """
    Calculate the power spectrum monopole P0(k) using Triumvirate.
    Assumes cat_data and cat_random were read with read_data() that:
      - converts columns to native-endian
      - promotes RA/DEC/Z/NZ/WEIGHT_* to float64
    """
    # Parameters
    paramset = fetch_paramset_template('dict')
    paramset['boxsize']['x'] = 5000.0
    paramset['boxsize']['y'] = 10000.0
    paramset['boxsize']['z'] = 5000.0
    paramset['ngrid']['x']   = 512
    paramset['ngrid']['y']   = 1024
    paramset['ngrid']['z']   = 512 
    paramset['tags']['output']  = tag
    paramset['degrees']['ELL']  = 0  
    paramset.update({
        'range': [0.005, 0.295],
        'num_bins': 29,
        'norm_convention': 'particle',
        'statistic_type': 'powspec',
    })

    # Data catalogue: spherical -> Cartesian (Mpc)
    r1_mpc = Planck15.comoving_distance(cat_data['Z']).to_value(u.Mpc)
    r1_mpc_h = r1_mpc * Planck15.h  # now in Mpc/h
    ra1  = np.deg2rad(cat_data['RA'])
    dec1 = np.deg2rad(cat_data['DEC'])

    x1 = r1_mpc_h * np.cos(dec1) * np.cos(ra1)
    y1 = r1_mpc_h * np.cos(dec1) * np.sin(ra1)
    z1 = r1_mpc_h * np.sin(dec1)

    ws1 = cat_data['WEIGHT_SYSTOT'] * (cat_data['WEIGHT_NOZ'] + cat_data['WEIGHT_CP'] - 1.0)
    wc1 = cat_data['WEIGHT_FKP']
    nz1 = cat_data['NZ']

    catalogue_data = ParticleCatalogue(x1, y1, z1, nz=nz1, ws=ws1, wc=wc1)

    # Random catalogue
    r2_mpc = Planck15.comoving_distance(cat_random['Z']).to_value(u.Mpc)
    r2_mpc_h = r2_mpc * Planck15.h  # now in Mpc/h
    ra2  = np.deg2rad(cat_random['RA'])
    dec2 = np.deg2rad(cat_random['DEC'])

    x2 = r2_mpc_h * np.cos(dec2) * np.cos(ra2)
    y2 = r2_mpc_h * np.cos(dec2) * np.sin(ra2)
    z2 = r2_mpc_h * np.sin(dec2)

    wc2 = cat_random['WEIGHT_FKP']
    nz2 = cat_random['NZ']

    catalogue_rand = ParticleCatalogue(x2, y2, z2, nz=nz2, wc=wc2)

    # Compute power spectrum (monopole)
    results = compute_powspec(
        catalogue_data,
        catalogue_rand,
        degree=0,
        paramset=paramset,
        save='.txt'  # or a filename/path you prefer
    )
    return results
Here we use a grid with 512 grid points in two dimensions and 1024 in one dimension. There are also two weightings included in this calculation, the 'Weight' and 'WEIGHT_FKP' columns. The first weight refers to a correction of incompleteness in the data catalogue, while the second weight tries to optimise the signal-to-noise.

If we use this code to measure the BOSS power spectrum for the northern and southern part we get
Figure 3: Power spectrum measurements for the northern (orange)
and southern (blue) parts of the BOSS dataset.

Here we see that the power spectra of the northern and southern parts are slightly different. Whether these differences are significant is not clear though, since we don't have uncertainties on these measurements (yet).

To extract cosmological information from this power spectrum we need to get a model. We use the classy package
from classy import Class
def get_Pk(Omega_m = 0.3):

    #Start by specifying the cosmology
    Omega_b = 0.05
    Omega_cdm = Omega_m - Omega_b
    h = 0.7 #H0/100
    A_s = 2.1e-9
    n_s = 0.96

    #Create a params dictionary
    #Need to specify the max wavenumber
    k_max = 10 #UNITS: 1/Mpc

    params = {
                 'output':'mPk',
                 'non linear':'halofit',
                 'Omega_b':Omega_b,
                 'Omega_cdm':Omega_cdm,
                 'h':h,
                 'A_s':A_s,
                 'n_s':n_s,
                 'P_k_max_1/Mpc':k_max,
                 'z_max_pk':10. #Default value is 10
    }

    #Initialize the cosmology and compute everything
    cosmo = Class()
    cosmo.set(params)
    cosmo.compute()

    #Specify k and z
    k = np.logspace(-5, np.log10(k_max), num=1000) #Mpc^-1
    z = 1.

    #Call these for the nonlinear and linear matter power spectra
    Pnonlin = np.array([cosmo.pk(ki, z) for ki in k])
    Plin = np.array([cosmo.pk_lin(ki, z) for ki in k])

    #NOTE: You will need to convert these to h/Mpc and (Mpc/h)^3
    #to use in the toolkit. To do this you would do:
    k /= h
    Plin *= h**3
    Pnonlin *= h**3
    return k, Plin
We can now plot the power spectrum for different values of cold dark matter density $\Omega_{cdm}$
Figure 4: Power spectrum models with different amounts of dark
matter using the class Boltzmann code.

Comparing the model with the data can allow us to determine the amount of dark matter in the Universe.

However, before we can go ahead and constrain cosmological parameters we have to get an estimate of the uncertainty on the measurements. That will be the subject of the next post.
You can find the code for this project on GitHub.
cheers
Florian

Saturday, October 7, 2017

Study the Universe with Python tutorial, part 1 -- the BOSS dataset

In the next few blog posts, I will introduce a dataset of more than 1 million galaxies and I will show how easy it is to use this dataset to test cosmological models using Python. This tutorial is intended for non-experts or anybody who wants to get their hands on a very exciting dataset. If you are looking for an interesting Python project, look no further.

First, we need to download the data. We are going to use the dataset produced by the Baryon Oscillation Spectroscopic Survey (BOSS) collaboration, which represents the largest dataset of its kind to date. You can download all necessary data with the following 2 wget commands, just provide the destination path at the end. This will download a bit less than 3Gb of zipped data, which turns into 8Gb after you unzip it.
wget -N https://data.sdss.org/sas/dr12/boss/lss/galaxy_DR12v5_CMASSLOWZTOT_North.fits.gz -P path/to/folder/
wget -N https://data.sdss.org/sas/dr12/boss/lss/galaxy_DR12v5_CMASSLOWZTOT_South.fits.gz -P path/to/folder/
For this project, we need the Python package triumvirate. Please follow the install instructions on the linked page.
Now we are ready to go.

We start by reading in the data catalog using the fitsio package
from astropy.io import fits
from astropy.table import Table
def read_data(filename):
    """
    Read a FITS catalogue and:
      - convert all columns to native-endian (NumPy 2.x safe)
      - make RA, DEC, Z, NZ, and WEIGHT_* columns float64
      - ensure columns are contiguous
      - apply the redshift mask 0.5 < Z < 0.75
    Returns:
      filtered_table (Astropy Table), col_names (list)
    """
    with fits.open(filename, memmap=False) as hdul:
        data = hdul[1].data  # FITS_rec
        t = Table(data)      # work in an Astropy Table for easy column ops

    # 1) Make all columns native-endian and contiguous
    for name in t.colnames:
        col = np.asanyarray(t[name])
        dt = col.dtype
        if dt.byteorder not in ('=', '|'):
            # Convert big-endian to native-endian (NumPy 2.x compatible)
            native_dt = dt.newbyteorder('=')
            col = col.byteswap().view(native_dt)
        col = np.ascontiguousarray(col)
        t[name] = col

    # 2) Promote Triumvirate-relevant columns to float64 ("double")
    to_float64 = ['RA', 'DEC', 'Z', 'NZ', 'WEIGHT_SYSTOT', 'WEIGHT_NOZ', 'WEIGHT_CP', 'WEIGHT_FKP']
    for name in to_float64:
        if name in t.colnames:
            t[name] = np.ascontiguousarray(np.asarray(t[name], dtype=np.float64))

    # 3) Apply redshift mask
    if 'Z' not in t.colnames:
        raise KeyError("Column 'Z' not found in file.")
    redshift_mask = (t['Z'] > 0.5) & (t['Z'] < 0.75)
    t = t[redshift_mask]

    print("col_names =", t.colnames)
    print(len(t['Z']))

    return t, t.colnames
This function prints out the column headers, which should look like this

col_names =  ['RA', 'DEC', 'RUN', 'RERUN', 'CAMCOL', 'FIELD', 'ID', 'ICHUNK', 'IPOLY', 'ISECT', 'FRACPSF', 'EXPFLUX', 'DEVFLUX', 'PSFFLUX', 'MODELFLUX', 'FIBER2FLUX', 'R_DEV', 'EXTINCTION', 'PSF_FWHM', 'AIRMASS', 'SKYFLUX', 'EB_MINUS_V', 'IMAGE_DEPTH', 'IMATCH', 'Z', 'WEIGHT_FKP', 'WEIGHT_CP', 'WEIGHT_NOZ', 'WEIGHT_STAR', 'WEIGHT_SEEING', 'WEIGHT_SYSTOT', 'NZ', 'COMP', 'PLATE', 'FIBERID', 'MJD', 'FINALN', 'TILE', 'SPECTILE', 'ICOLLIDED', 'INGROUP', 'MULTGROUP']


In this tutorial, we will use only a subset of the columns given here. Namely, we will only use the positions of the galaxies given by the angles on the sky in 'RA' (right ascension) and 'DEC' (declination) as well as the redshift 'Z'. The redshift describes how much the light spectrum of the galaxy has been shifted in wavelength due to the expansion of the Universe. Most of the galaxies in this dataset are several billion light-years away from us, meaning the light travelled a significant fraction of the existence of the Universe before reaching us. Galaxies which are further away have a larger redshift and we will use the redshift later, together with the right ascension and declination, to get a 3D position for all galaxies. 

Let's explore the dataset a bit before we go any further. We can read in the data
    filename = 'galaxy_DR12v5_CMASSLOWZTOT_South.fits'
    data_south, _ = read_data(filename)
    filename = 'galaxy_DR12v5_CMASSLOWZTOT_North.fits'
    data_north, _ = read_data(filename)
    angular_plot(data_south, data_north)
and plot the angular distribution (right ascension and declination) with
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
def angular_plot(cat1, cat2):
    ''' Plot the catalogues '''
    cat1['RA'] -= 180
    c1 = down_sample([cat1['RA'], cat1['DEC']], 2000)
    cat2['RA'] -= 180
    c2 = down_sample([cat2['RA'], cat2['DEC']], 2000)

    plt.clf()
    plt.subplot(projection="aitoff")
    plt.title("BOSS DR12 survey footprint", y=1.1)
    plt.plot(np.radians(c1[0]), np.radians(c1[1]), '.')
    plt.plot(np.radians(c2[0]), np.radians(c2[1]), '.')
    plt.grid(True)
    plt.show()
    return 
Here we read in the two data catalogues, down-sample these catalogues (scatter plots with 1 million points will be difficult to handle), and convert the angles from degrees to radians. The down_sample() function is defined as
import random
def down_sample(in_list, N):
    n = len(in_list[0])
    if N <= n:
        idx = random.sample(range(n), N)           # no replacement
    else:
        idx = random.choices(range(n), k=N)        # with replacement

    out_list = []
    for cat in in_list:
        out_list.append([cat[i] for i in idx])
    return out_list
We make a plot by projecting the data onto a sphere using Aitoff projection. 
Figure 1: The sky distribution of galaxies in the BOSS galaxy survey using Aitoff projection.
Only 1000 points for each region are plotted.
This plot shows the sky coverage of the dataset. The galaxy survey observed in two patches on the sky, which explains why we have two files. We will call them north and south from now on even though east and west might be more intuitive from this figure. However, this is just because of the Aitoff projection. In fact, the plane of the Milky Way galaxy does go right in between the two patches, so that the southern part (blue) is indeed south of the Milky Way and the northern part (yellow) is north of it.

The northern part has about twice as many galaxies as the southern part and in total this dataset covers about 1 quarter of the sky.

We can also plot the redshift distribution
def redshift_plot(cat1, cat2):
    ''' Plot the catalogues '''
    bins = np.arange(0., 0.9, 0.01)
    weight = cat1['WEIGHT_SYSTOT'] * (cat1['WEIGHT_NOZ'] + cat1['WEIGHT_CP'] - 1.0)
    plt.hist(cat1['Z'], weights=weight, bins=bins, alpha=0.4, label='BOSS South')
    weight = cat2['WEIGHT_SYSTOT'] * (cat2['WEIGHT_NOZ'] + cat2['WEIGHT_CP'] - 1.0)
    plt.hist(cat2['Z'], weights=weight, bins=bins, alpha=0.4, label='BOSS North')
    plt.xlabel("redshift")
    plt.legend(loc=0)
    plt.show()
    return 
which looks like this
Figure 2: Redshift distribution of the BOSS galaxy sample. Redshift
0.7 corresponds to about 6 000 000 000 (6 billion) light-years. So when the light from
these galaxies was sent out, the Universe was only about 7 billion years old, 
while now it is 13 billion years old.
The next step is to turn the RA, DEC, z system into Cartesian coordinates. However, this transformation depends on the cosmological model. Using the current standard model of cosmology (also called $\Lambda$CDM) we can set up a cosmological framework by fixing a small set of parameters
(1) The amount of baryonic matter ($\Omega_b$)
(2) The amount of cold dark matter ($\Omega0_{cdm}$)
(3) The Hubble parameter $h$ (or more precisely the Hubble parameter divided by 100, $h = H/100$)
In principle, there are more free parameters, but these are the ones that matter to this analysis. Instead of fixing those parameters individually, we will use the Planck cosmology published in 2015
from mpl_toolkits import mplot3d
from astropy import units as u
from astropy.coordinates import SkyCoord, Distance
from astropy.cosmology import Planck15
def threeD_plot(cat1, cat2):
    # Get cartesian coordinates using the Planck 2018 cosmology
    c1 = down_sample([cat1['Z'], cat1['RA'], cat1['DEC']], 2000)
    dist1 = Planck15.comoving_distance(c1[0]).to_value(u.Mpc)
    dist1_h = dist1 * Planck15.h  # now in Mpc/h
    catalog1 = {}
    catalog1['x'] = dist1_h * np.cos(np.radians(c1[2])) * np.cos(np.radians(c1[1]))
    catalog1['y'] = dist1_h * np.cos(np.radians(c1[2])) * np.sin(np.radians(c1[1]))
    catalog1['z'] = dist1_h * np.sin(np.radians(c1[2])) 

    c2 = down_sample([cat2['Z'], cat2['RA'], cat2['DEC']], 2000)
    dist2 = Planck15.comoving_distance(c2[0]).to_value(u.Mpc)
    dist2_h = dist2 * Planck15.h  # now in Mpc/h
    catalog2 = {}
    catalog2['x'] = dist2_h * np.cos(np.radians(c2[2])) * np.cos(np.radians(c2[1]))
    catalog2['y'] = dist2_h * np.cos(np.radians(c2[2])) * np.sin(np.radians(c2[1]))
    catalog2['z'] = dist2_h * np.sin(np.radians(c2[2]))

    fig = plt.figure()
    ax = plt.axes(projection='3d')
    ax.scatter(catalog1['x'], catalog1['y'], catalog1['z'])
    ax.scatter(catalog2['x'], catalog2['y'], catalog2['z'])
    plt.show()
    return
The $\Lambda$CDM cosmology here assumes a flat universe with no curvature, which means that the dark energy parameter is automatically fixed through
\[
\Omega_{DarkEnergy} = 1 - \Omega_{cdm} + \Omega_b.
\]These parameters constrain the expansion history of the Universe, including its size and age. The dataset we just read with our Python script is powerful enough to constrain these parameters and, therefore, teach us something about dark matter and dark energy. We will get to that in a later post.

The code above converts the angles (RA, DEC) together with the redshift to a 3D position for the galaxies producing a plot which should look like this
Figure 3: 3D distribution of BOSS galaxies. Only 1000 randomly selected
galaxies out of more than 1 million in this dataset are included in this plot. 
Milky Way is located in the middle and the galaxies in the northern (yellow)
and southern (blue) part are observed in two opposite directions. 
And this concludes this first look at the BOSS dataset. This dataset was released in 2016 and represents cutting edge cosmological research. How to use this dataset to constrain dark matter and dark energy will be the subject of the next blog posts.

The entire Python code for this project is available on GitHub.
cheers
Florian