My statistics blog: Study the Universe with Python tutorial, part 1 -- the BOSS dataset

In the next few blog posts, I will introduce a dataset of more than 1 million galaxies and I will show how easy it is to use this dataset to test cosmological models using Python. This tutorial is intended for non-experts or anybody who wants to get their hands on a very exciting dataset. If you are looking for an interesting Python project, look no further.

First, we need to download the data. We are going to use the dataset produced by the Baryon Oscillation Spectroscopic Survey (BOSS) collaboration, which represents the largest dataset of its kind to date. You can download all necessary data with the following 2 wget commands, just provide the destination path at the end. This will download a bit less than 3Gb of zipped data, which turns into 8Gb after you unzip it.

wget -N https://data.sdss.org/sas/dr12/boss/lss/galaxy_DR12v5_CMASSLOWZTOT_North.fits.gz -P path/to/folder/
wget -N https://data.sdss.org/sas/dr12/boss/lss/galaxy_DR12v5_CMASSLOWZTOT_South.fits.gz -P path/to/folder/

For this project, we need the Python package triumvirate. Please follow the install instructions on the linked page.
Now we are ready to go.

We start by reading in the data catalog using the fitsio package

from astropy.io import fits
from astropy.table import Table
def read_data(filename):
    """
    Read a FITS catalogue and:
      - convert all columns to native-endian (NumPy 2.x safe)
      - make RA, DEC, Z, NZ, and WEIGHT_* columns float64
      - ensure columns are contiguous
      - apply the redshift mask 0.5 < Z < 0.75
    Returns:
      filtered_table (Astropy Table), col_names (list)
    """
    with fits.open(filename, memmap=False) as hdul:
        data = hdul[1].data  # FITS_rec
        t = Table(data)      # work in an Astropy Table for easy column ops

    # 1) Make all columns native-endian and contiguous
    for name in t.colnames:
        col = np.asanyarray(t[name])
        dt = col.dtype
        if dt.byteorder not in ('=', '|'):
            # Convert big-endian to native-endian (NumPy 2.x compatible)
            native_dt = dt.newbyteorder('=')
            col = col.byteswap().view(native_dt)
        col = np.ascontiguousarray(col)
        t[name] = col

    # 2) Promote Triumvirate-relevant columns to float64 ("double")
    to_float64 = ['RA', 'DEC', 'Z', 'NZ', 'WEIGHT_SYSTOT', 'WEIGHT_NOZ', 'WEIGHT_CP', 'WEIGHT_FKP']
    for name in to_float64:
        if name in t.colnames:
            t[name] = np.ascontiguousarray(np.asarray(t[name], dtype=np.float64))

    # 3) Apply redshift mask
    if 'Z' not in t.colnames:
        raise KeyError("Column 'Z' not found in file.")
    redshift_mask = (t['Z'] > 0.5) & (t['Z'] < 0.75)
    t = t[redshift_mask]

    print("col_names =", t.colnames)
    print(len(t['Z']))

    return t, t.colnames

This function prints out the column headers, which should look like this

col_names = ['RA', 'DEC', 'RUN', 'RERUN', 'CAMCOL', 'FIELD', 'ID', 'ICHUNK', 'IPOLY', 'ISECT', 'FRACPSF', 'EXPFLUX', 'DEVFLUX', 'PSFFLUX', 'MODELFLUX', 'FIBER2FLUX', 'R_DEV', 'EXTINCTION', 'PSF_FWHM', 'AIRMASS', 'SKYFLUX', 'EB_MINUS_V', 'IMAGE_DEPTH', 'IMATCH', 'Z', 'WEIGHT_FKP', 'WEIGHT_CP', 'WEIGHT_NOZ', 'WEIGHT_STAR', 'WEIGHT_SEEING', 'WEIGHT_SYSTOT', 'NZ', 'COMP', 'PLATE', 'FIBERID', 'MJD', 'FINALN', 'TILE', 'SPECTILE', 'ICOLLIDED', 'INGROUP', 'MULTGROUP']

In this tutorial, we will use only a subset of the columns given here. Namely, we will only use the positions of the galaxies given by the angles on the sky in 'RA' (right ascension) and 'DEC' (declination) as well as the redshift 'Z'. The redshift describes how much the light spectrum of the galaxy has been shifted in wavelength due to the expansion of the Universe. Most of the galaxies in this dataset are several billion light-years away from us, meaning the light travelled a significant fraction of the existence of the Universe before reaching us. Galaxies which are further away have a larger redshift and we will use the redshift later, together with the right ascension and declination, to get a 3D position for all galaxies.

Let's explore the dataset a bit before we go any further. We can read in the data

    filename = 'galaxy_DR12v5_CMASSLOWZTOT_South.fits'
    data_south, _ = read_data(filename)
    filename = 'galaxy_DR12v5_CMASSLOWZTOT_North.fits'
    data_north, _ = read_data(filename)
    angular_plot(data_south, data_north)

and plot the angular distribution (right ascension and declination) with

import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
def angular_plot(cat1, cat2):
    ''' Plot the catalogues '''
    cat1['RA'] -= 180
    c1 = down_sample([cat1['RA'], cat1['DEC']], 2000)
    cat2['RA'] -= 180
    c2 = down_sample([cat2['RA'], cat2['DEC']], 2000)

    plt.clf()
    plt.subplot(projection="aitoff")
    plt.title("BOSS DR12 survey footprint", y=1.1)
    plt.plot(np.radians(c1[0]), np.radians(c1[1]), '.')
    plt.plot(np.radians(c2[0]), np.radians(c2[1]), '.')
    plt.grid(True)
    plt.show()
    return

Here we read in the two data catalogues, down-sample these catalogues (scatter plots with 1 million points will be difficult to handle), and convert the angles from degrees to radians. The down_sample() function is defined as

import random
def down_sample(in_list, N):
    n = len(in_list[0])
    if N <= n:
        idx = random.sample(range(n), N)           # no replacement
    else:
        idx = random.choices(range(n), k=N)        # with replacement

    out_list = []
    for cat in in_list:
        out_list.append([cat[i] for i in idx])
    return out_list

We make a plot by projecting the data onto a sphere using Aitoff projection.

Figure 1: The sky distribution of galaxies in the BOSS galaxy survey using Aitoff projection.

Only 1000 points for each region are plotted.

This plot shows the sky coverage of the dataset. The galaxy survey observed in two patches on the sky, which explains why we have two files. We will call them north and south from now on even though east and west might be more intuitive from this figure. However, this is just because of the Aitoff projection. In fact, the plane of the Milky Way galaxy does go right in between the two patches, so that the southern part (blue) is indeed south of the Milky Way and the northern part (yellow) is north of it.

The northern part has about twice as many galaxies as the southern part and in total this dataset covers about 1 quarter of the sky.

We can also plot the redshift distribution

def redshift_plot(cat1, cat2):
    ''' Plot the catalogues '''
    bins = np.arange(0., 0.9, 0.01)
    weight = cat1['WEIGHT_SYSTOT'] * (cat1['WEIGHT_NOZ'] + cat1['WEIGHT_CP'] - 1.0)
    plt.hist(cat1['Z'], weights=weight, bins=bins, alpha=0.4, label='BOSS South')
    weight = cat2['WEIGHT_SYSTOT'] * (cat2['WEIGHT_NOZ'] + cat2['WEIGHT_CP'] - 1.0)
    plt.hist(cat2['Z'], weights=weight, bins=bins, alpha=0.4, label='BOSS North')
    plt.xlabel("redshift")
    plt.legend(loc=0)
    plt.show()
    return

which looks like this

Figure 2: Redshift distribution of the BOSS galaxy sample. Redshift

0.7 corresponds to about 6 000 000 000 (6 billion) light-years. So when the light from

these galaxies was sent out, the Universe was only about 7 billion years old,
while now it is 13 billion years old.

The next step is to turn the RA, DEC, z system into Cartesian coordinates. However, this transformation depends on the cosmological model. Using the current standard model of cosmology (also called $\Lambda$CDM) we can set up a cosmological framework by fixing a small set of parameters
(1) The amount of baryonic matter ($\Omega_b$)
(2) The amount of cold dark matter ($\Omega0_{cdm}$)

(3) The Hubble parameter $h$ (or more precisely the Hubble parameter divided by 100, $h = H/100$)

In principle, there are more free parameters, but these are the ones that matter to this analysis. Instead of fixing those parameters individually, we will use the Planck cosmology published in 2015

from mpl_toolkits import mplot3d
from astropy import units as u
from astropy.coordinates import SkyCoord, Distance
from astropy.cosmology import Planck15
def threeD_plot(cat1, cat2):
    # Get cartesian coordinates using the Planck 2018 cosmology
    c1 = down_sample([cat1['Z'], cat1['RA'], cat1['DEC']], 2000)
    dist1 = Planck15.comoving_distance(c1[0]).to_value(u.Mpc)
    dist1_h = dist1 * Planck15.h  # now in Mpc/h
    catalog1 = {}
    catalog1['x'] = dist1_h * np.cos(np.radians(c1[2])) * np.cos(np.radians(c1[1]))
    catalog1['y'] = dist1_h * np.cos(np.radians(c1[2])) * np.sin(np.radians(c1[1]))
    catalog1['z'] = dist1_h * np.sin(np.radians(c1[2])) 

    c2 = down_sample([cat2['Z'], cat2['RA'], cat2['DEC']], 2000)
    dist2 = Planck15.comoving_distance(c2[0]).to_value(u.Mpc)
    dist2_h = dist2 * Planck15.h  # now in Mpc/h
    catalog2 = {}
    catalog2['x'] = dist2_h * np.cos(np.radians(c2[2])) * np.cos(np.radians(c2[1]))
    catalog2['y'] = dist2_h * np.cos(np.radians(c2[2])) * np.sin(np.radians(c2[1]))
    catalog2['z'] = dist2_h * np.sin(np.radians(c2[2]))

    fig = plt.figure()
    ax = plt.axes(projection='3d')
    ax.scatter(catalog1['x'], catalog1['y'], catalog1['z'])
    ax.scatter(catalog2['x'], catalog2['y'], catalog2['z'])
    plt.show()
    return

The $\Lambda$CDM cosmology here assumes a flat universe with no curvature, which means that the dark energy parameter is automatically fixed through

\[
\Omega_{DarkEnergy} = 1 - \Omega_{cdm} + \Omega_b.
\]These parameters constrain the expansion history of the Universe, including its size and age. The dataset we just read with our Python script is powerful enough to constrain these parameters and, therefore, teach us something about dark matter and dark energy. We will get to that in a later post.

The code above converts the angles (RA, DEC) together with the redshift to a 3D position for the galaxies producing a plot which should look like this

Figure 3: 3D distribution of BOSS galaxies. Only 1000 randomly selected

galaxies out of more than 1 million in this dataset are included in this plot.

Milky Way is located in the middle and the galaxies in the northern (yellow)

and southern (blue) part are observed in two opposite directions.

And this concludes this first look at the BOSS dataset. This dataset was released in 2016 and represents cutting edge cosmological research. How to use this dataset to constrain dark matter and dark energy will be the subject of the next blog posts.

The entire Python code for this project is available on GitHub.

cheers
Florian

2 comments:

AnonymousNovember 14, 2017 at 10:01 AM
Just a small typo, "from nbodykit.lab" should probably be "import nbodykit.lab". Also for some strange reason, something in nbodykit is raising strange errors if you try to run this in a Jupyter notebook. Do you understand why?
RafaelDecember 5, 2019 at 8:49 AM
Huge Thanks for this site, it is very useful to the begin the studies with cosmology.

Saturday, October 7, 2017

Study the Universe with Python tutorial, part 1 -- the BOSS dataset

2 comments: