My statistics blog: Study the Universe with Python tutorial, part 1 -- the BOSS dataset

In the next few blog posts, I will introduce a dataset of more than 1 million galaxies and I will show how easy it is to use this dataset to test cosmological models using Python. This tutorial is intended for non-experts or anybody who wants to get their hands on a very exciting dataset. If you are looking for an interesting Python project, look no further.

First, we need to download the data. We are going to use the dataset produced by the Baryon Oscillation Spectroscopic Survey (BOSS) collaboration which represents the largest dataset of its kind to date. You can download all necessary data with the following 2 wget commands, just provide the destination path at the end. This will download a bit less than 3Gb of zipped data which turns into 8Gb after you unzip it.

wget -N https://data.sdss.org/sas/dr12/boss/lss/galaxy_DR12v5_CMASSLOWZTOT_North.fits.gz -P path/to/folder/
wget -N https://data.sdss.org/sas/dr12/boss/lss/galaxy_DR12v5_CMASSLOWZTOT_South.fits.gz -P path/to/folder/

For this project, we need the Python package nbodykit, which can be installed via conda

conda create --name nbodykit-env python=3
source activate nbodykit-env 
conda install -c bccp nbodykit

or simply use pip

pip install nbodykit

Now we are ready to go.

We start by reading in the data catalog using the fitsio package, which is included in nbodykit

import nbodykit.lab
def read_data(filename):
    ''' Here we read the data catalogues '''
    galaxies_cat = nbodykit.lab.FITSCatalog(os.path.join(base_dir, filename))
    print('galaxies_cat.columns = ', galaxies_cat.columns)
    galaxies_cat = galaxies_cat[(galaxies_cat['Z'] > 0.01) & (galaxies_cat['Z'] < 0.9)]
    galaxies_cat['Weight'] = galaxies_cat['WEIGHT_SYSTOT'] * (galaxies_cat['WEIGHT_NOZ'] + galaxies_cat['WEIGHT_CP'] - 1.0)
    return galaxies_cat

Here we included the nbodykit.lab functions, one of which is the FITSCatalog() function, which reads all columns in the fits file and stores them in a dask array. With galaxies_cat.columns you can print out the column header of the file, which should look like this

galaxies_cat.columns = ['AIRMASS', 'CAMCOL', 'COMP', 'DEC', 'DEVFLUX', 'EB_MINUS_V', 'EXPFLUX', 'EXTINCTION', 'FIBER2FLUX', 'FIBERID', 'FIELD', 'FINALN', 'FRACPSF', 'ICHUNK', 'ICOLLIDED', 'ID', 'IMAGE_DEPTH', 'IMATCH', 'INGROUP', 'IPOLY', 'ISECT', 'MJD', 'MODELFLUX', 'MULTGROUP', 'NZ', 'PLATE', 'PSFFLUX', 'PSF_FWHM', 'RA', 'RERUN', 'RUN', 'R_DEV', 'SKYFLUX', 'SPECTILE', 'Selection', 'TILE', 'Value', 'WEIGHT_CP', 'WEIGHT_FKP', 'WEIGHT_NOZ', 'WEIGHT_SEEING', 'WEIGHT_STAR', 'WEIGHT_SYSTOT', 'Weight', 'Z']

In this tutorial, we will use only a subset of the columns given here. Namely, we will only use the positions of the galaxies given by the angles on the sky in 'RA' (right ascension) and 'DEC' (declination) as well as the redshift 'Z'. The redshift describes how much the light spectrum of the galaxy has been shifted in wavelength due to the expansion of the Universe. Most of the galaxies in this dataset are several billion light-years away from us, meaning the light travelled a significant fraction of the existence of the Universe before reaching us. Galaxies which are further away have a larger redshift and we will use the redshift later together with the right ascension and declination to get a 3D position for all galaxies.

Let's explore the dataset a bit before we go any further. We can read in the data

filename = 'galaxy_DR12v5_CMASSLOWZTOT_South.fits'
data_south = read_data(filename)
filename = 'galaxy_DR12v5_CMASSLOWZTOT_North.fits'
data_north = read_data(filename)
plot_data(data_south, data_north)

and plot the right ascension and declination

def plot_data(cat1, cat2):
    ''' Plot the catalogues '''
    cat1 = down_sample(cat1, 1000)
    cat1['RA'] -= 180
    cat2 = down_sample(cat2, 1000)
    cat2['RA'] -= 180

    plt.clf()
    plt.subplot(projection="aitoff")
    plt.title("BOSS DR12 survey footprint", y=1.1)
    plt.plot(np.radians(cat1['RA'].compute()), np.radians(cat1['DEC'].compute()), '.')
    plt.plot(np.radians(cat2['RA'].compute()), np.radians(cat2['DEC'].compute()), '.')
    plt.grid(True)
    plt.show()

Here we read in the two data catalogues, down-sample these catalogues (scatter plots with 1 million points will be difficult to handle) and convert the angles from degrees to radians. The down_sample() function is defined as

def down_sample(cat, N):
    x = np.random.uniform(size=cat.size)
    return cat[x < float(N) / cat.size]

We make a plot by projecting the data onto a sphere using Aitoff projection.

Figure 1: The sky distribution of galaxies in the BOSS galaxy survey using Aitoff projection.

Only 1000 points for each region are plotted.

This plot shows the sky coverage of the dataset. The galaxy survey observed in two patches on the sky, which explains why we have two files. We will call them north and south from now on even though east and west might be more intuitive from this figure. However, this is just because of the aitoff projection. In fact, the plane of the Milky Way galaxy does go right in between the two patches, so that the southern part (blue) is indeed south of the Milky Way and the northern part (yellow) is north of it.

The northern part has about twice as many galaxies as the southern part and in total this dataset covers about 1 quarter of the sky.

We can also plot the redshift distribution

bins = np.arange(0., 0.9, 0.01)
plt.hist(cat1['Z'].compute(), weights=cat1['Weight'].compute(), bins=bins, alpha=0.4, label='BOSS South')
plt.hist(cat2['Z'].compute(), weights=cat2['Weight'].compute(), bins=bins, alpha=0.4, label='BOSS North')
plt.xlabel("redshift")
plt.legend(loc=0)
plt.show()

which looks like this

Figure 2: Redshift distribution of the BOSS galaxy sample. Redshift

0.7 corresponds to about 6 000 000 000 (6 billion) light-years. So when the light from

these galaxies was sent out the Universe was only about 7 billion years old,
while now it is 13 billion years old.

The next step is to turn the RA, DEC, z system into Cartesian coordinates. However, this transformation depends on the cosmological model. Using the current standard model of cosmology (also called

$\Lambda$ CDM) we can set up a cosmological framework by fixing a small set of parameters
(1) The amount of baryonic matter (Omega0_b)
(2) The amount of cold dark matter (Omega0_cdm)

(3) The Hubble parameter h (or more precisely the Hubble parameter divided by 100, h = H/100)

In principle, there are more free parameters, but these are the ones that matter to this analysis. Using nbodykit we can set a cosmology with

cosmo = nbodykit.lab.cosmology.Cosmology(Omega0_cdm=0.265, Omega0_b=0.045, h=0.7)

Here we also specified that we want to use a flat cosmology, which has no curvature which means that the dark energy parameter is automatically fixed through

$\Omega_{DarkEnergy} = 1 - \Omega_{cdm} + \Omega_b.$ These parameters constrain the expansion history of the Universe, including its size and age. The dataset we just read with our Python script is powerful enough to constrain these parameters and therefore teach us something about dark matter and dark energy. We will get to that in a later post.

Here we want to convert the angles (RA, DEC) together with the redshift to a 3D position for the galaxies. To do this we can just call a function provided by nobodykit

galaxies_cat_1['Position'] = nbodykit.transform.SkyToCartesian(galaxies_cat_1['RA'],\
    galaxies_cat_1['DEC'], galaxies_cat_1['Z'], cosmo=cosmo)
galaxies_cat_2['Position'] = nbodykit.transform.SkyToCartesian(galaxies_cat_2['RA'],\
    galaxies_cat_2['DEC'], galaxies_cat_2['Z'], cosmo=cosmo)

with this, we can now make a 3D plot of all galaxies

Figure 3: 3D distribution of BOSS galaxies. Only 1000 randomly selected

galaxies out of more than 1 million in this dataset are included in this plot.

Milky Way is located in the middle and the galaxies in the northern (yellow)

and southern (blue) part are observed in two opposite directions.

And this concludes this first look at the BOSS dataset. This dataset has only been released last year (2016) and represents cutting edge cosmological research. How to use this dataset to constrain dark matter and dark energy will be the subject of the next blog posts.

The entire Python code for this project is available on GitHub.

cheers
Florian

2 comments:

AnonymousNovember 14, 2017 at 10:01 AM
Just a small typo, "from nbodykit.lab" should probably be "import nbodykit.lab". Also for some strange reason, something in nbodykit is raising strange errors if you try to run this in a Jupyter notebook. Do you understand why?
RafaelDecember 5, 2019 at 8:49 AM
Huge Thanks for this site, it is very useful to the begin the studies with cosmology.

Saturday, October 7, 2017

Study the Universe with Python tutorial, part 1 -- the BOSS dataset

2 comments: