Wednesday, December 20, 2017

Building a random forest classifier from scratch in Python

A random forest classifier uses decision trees to classify objects. The outcome of the individual decision tree results are counted and the one with the highest score is chosen. The reason for using multiple decision trees is to reduce overfitting, which is often present in decision trees.

We can build a random forest from a training sample of size $M$ by following these 4 steps:

1. Randomly select $k << M$ samples with replacement from the dataset (bagging).
2. Build a decision tree: Pick a feature randomly from the set of $F$ available features and determine the split points with the highest information gain.
3. Continue picking features randomly until the decision tree reaches a maximum depth or has only leaf nodes.
4. Repeat steps 1-3 $B$ times to build a random forest classifier with $B$ trees.

Step 1: Bagging


Bagging or bootstrapping is a term which describes building sub-samples of a larger sample by picking a number of objects $k$ with replacement.
k_indices = np.random.choice(len(train), k)
train[k_indices]
Using bootstrapped training samples is a way to decorrelate the decision trees. If the trees are uncorrelated, increasing the number of trees will always lead to a reduction in the variance of the random forest without increasing the bias.

The uncertainty of a random forest with $B$ uncorrelated decision trees can be calculated as the variance between the different trees
\[\sigma^2 = \frac{1}{B-1}\sum^B_i[f(x') - \overline{f}]^2\]
The training and test error tend to level off after some number of trees have been fit and can be determined using cross-validation.

Step 2: Build a decision tree


We start with the root node. To fully specify this node we need to split the dataset into two daughter nodes. The way in which the dataset is split at each node is the main aspect that categorizes the tree we are building.

In our case, we determine the split point by the maximum information gain. The information gain is defined as
\[IG = I(D_p) - \frac{N_{left}}{N_p}I(D_{left}) - \frac{N_{right}}{N_p}I(D_{right})\]
and $I(D)$ can be the Gini index or Entropy:
\[
\begin{align}
I^{Gini}(D) &= 1 - \sum^{N_{classes}}_{i}p_i^2\\
I^{Entropy}(D) &= - \sum^{N_{classes}}_{i}p_i\log(p_i)
\end{align}
\]
where $p$ is the probability that a random pick from all elements in that node is a member of that class. For the rest of this blog post I will just focus on the Gini index.

Let's go through an example. Assume we want to build a decision tree to decide whether a person is female or male, based on a set of features. Here we have only two classes male and female. Have a look at the following two possible split points:

Given these two possibilities, let's calculate the information gain for split number 1:
\[\begin{align}
I^{Gini}(D_{parent}) &= 1 - \left[\left(\frac{40}{80}\right)^2 + \left(\frac{40}{80}\right)^2\right] = 1 - (0.5^2 + 0.5^2) = 0.5\\
I^{Gini}(D_{left}) &= 1 - \left[\left(\frac{30}{40}\right)^2 + \left(\frac{10}{40}\right)^2\right] = 1 - \left[\left(\frac{3}{4}\right)^2 + \left(\frac{1}{4}\right)^2\right] = \frac{3}{8} = 0.375\\
I^{Gini}(D_{right}) &= 1 - \left[\left(\frac{10}{40}\right)^2 + \left(\frac{30}{40}\right)^2\right] = 1 - \left[\left(\frac{1}{4}\right)^2 + \left(\frac{3}{4}\right)^2\right] = \frac{3}{8} = 0.375\\
IG &= 0.5 - \frac{40}{80}\times 0.375 - \frac{40}{80}\times 0.375 = 0.125
\end{align}\]
So the information gain is $0.125$. Let's do the same for split number 2:
\[\begin{align}
I^{Gini}(D_{parent}) &= 0.5 \text{   (as before)}\\
I^{Gini}(D_{left}) &= 1 - \left[\left(\frac{20}{60}\right)^2 + \left(\frac{40}{60}\right)^2\right] = 1 - \left[\left(\frac{1}{3}\right)^2 + \left(\frac{2}{3}\right)^2\right] = 1 - \frac{5}{9} = 0.44\\
I^{Gini}(D_{right}) &= 1 - \left[\left(\frac{20}{20}\right)^2 + \left(\frac{0}{20}\right)^2\right] = 1 - (1 + 0) = 0\\
IG &= 0.5 - \frac{60}{80}\times 0.44 - 0 = 0.17
\end{align}\]
So the second split is preferred. In general, splits which separate classes always have a large information gain, with a complete split between the classes corresponding to the maximum information gain of $0.5$. An information gain of $0.5$ means that you have all the information at that point, since all classes are separated. Therefore there is no point in growing the tree any deeper.

Here are the functions which calculate the information gain using the Gini index:
def calc_information_gain(groups, list_of_class_ids):
    # count all samples
    Nall = sum([len(group) for group in groups])
    # calculate Gini index of parent node
    all_rows = [row for group in groups for row in group]
    IG = calc_gini(all_rows, list_of_class_ids)
    # calculate Gini index of daughter nodes
    for group in groups:
        IG -= calc_gini(group, list_of_class_ids)*len(group)/Nall
    return IG

def calc_gini(group, list_of_class_ids):
    Ngroup = len(group)
    if Ngroup == 0:
        return 0
    dataset_class_ids = [row[-1] for row in group]
    sum_over_classes = 0.
    for class_id in list_of_class_ids:
        prob = dataset_class_ids.count(class_id)/Ngroup
        sum_over_classes += prob**2
    return 1. - sum_over_classes
Note that building a tree like this is highly representative of the training dataset and hence is susceptible to overfitting. This is the reason why random forest classifiers build multiple trees with random subsets of the training dataset.

Splitting the dataset: Splitting a dataset involves iterating over all rows in the dataset, checking if the feature value is below or above the split value and assigning it to the left or right group, respectively.
def split_node(index, value, dataset):
    ''' Split the dataset into two using a feature index and 
    feature value '''
    left = []
    right = []
    for row in dataset:
        if row[index] < value:
            left.append(row)
        else:
            right.append(row)
    return [left, right]
The feature on which to split is chosen randomly when building a random forest. To find the value on which to split, we have to go through all possible splits and evaluate the information gain. This can be a very expensive calculation.

Our approach is to pick all possible values of the feature within the training dataset and evaluate the information gain for all of them. In that case, the cost will scale with the number of rows in the training dataset.

With the Gini function above and the test split function, we now have everything we need to evaluate all possible splits.
def get_split(dataset, index):
    ''' Evaluate all possible splits given the dataset and the index of 
    the feature on which to split '''
    list_of_class_ids = list(set(row[-1] for row in dataset))
    split_value, max_IG, split_groups = 0., -1., None
    for row in dataset:
        groups = split_node(index, row[index], dataset)
        IG = calc_information_gain(groups, list_of_class_ids)
        if IG > max_IG:
            split_value, max_IG, split_groups = row[index], IG, groups
    return {'index': index,'split_value': split_value,'groups': groups }
To build the tree we also need to decide how deep the tree should be. There are two possibilities: We can set a maximum tree depth from the root node or we set a minimum number of rows in a node as a stopping criteria. Here we will take both as input values for our tree building function.

Step 3: Building the decision tree


To build the decision tree we need to implement the following steps:

1. Randomly determine the first splitting feature.
2. Split the original dataset according to the first optimal split using the get_split() function.
3. Recursively repeat step 1 and 2 on the daughter nodes until the tree has reached the maximum depth or either left or right group of rows is smaller than the minimum number of rows.

And here is the implementation
def build_tree(train, max_depth, min_size):
    # randomly determine the feature index
    feature_index = int( np.random.random()*(len(right[0]) - 1) )
    root = get_split(train, feature_index)
    split(root, max_depth, min_size, 1)
    return root

def to_terminal(group):
    # Create a terminal node value
    list_of_classes = [row[-1] for row in group]
    return max(set(list_of_classes), key=list_of_classes.count)

def split(node, max_depth, min_size, depth):
    left, right = node['groups']
    del(node['groups'])
    # check for a no split
    if not left or not right:
        node['left'] = node['right'] = to_terminal(left + right)
        return
    # check for max depth
    if depth >= max_depth:
        node['left'] = to_terminal(left)
        node['right'] = to_terminal(right)
        return
    # process left child
    if len(left) <= min_size:
        node['left'] = to_terminal(left)
    else:
        feature_index = int( np.random.random()*(len(right[0]) - 1) )
        node['left'] = get_split(left, feature_index)
        split(node['left'], max_depth, min_size, depth+1)
    # process right child
    if len(right) <= min_size:
        node['right'] = to_terminal(right)
    else:
        feature_index = int( np.random.random()*(len(right[0]) - 1) )
        node['right'] = get_split(right, feature_index)
        split(node['right'], max_depth, min_size, depth+1)

Step 4: Building the forest


Building the forest just means building multiple trees, so we call the build_tree() function multiple times with bootstrapped training samples
def build_forest(train, k, N_trees):
    max_depth = 4
    min_size = 2
    forest = []
    for i in range(0, N_trees):
        # bootstrap training dataset
        k_indices = np.random.choice(len(train), k)
        forest.append(build_tree(train[k_indices], max_depth, min_size))
    return forest
And that is it. We now have a random forest classifier in the form of a list of dictionaries. To use this classifier to classify a new object we just need to loop over the decison trees, store the tree results and the majority decision is what gives the classification.

Here is an implementation together with a test example:
def main():
    train = np.array([[3.77,4.19,0],
    [4.77,1.169761413,0],
    [-5.,2.81281357,0],
    [3.1,2.61995032,0],
    [3.6,2.209014212,0],
    [1.2,-3.162953546,1],
    [2.3,-3.339047188,1],
    [5.6,0.476683375,1],
    [-1.3,-3.234550982,1],
    [2.1,-3.319983761,1]])
    forest = build_forest(train, k=10, N_trees=100)
    for row in train:
        prediction = make_prediction(forest, row)
        print('truth = %d : prediction = %d' % (row[-1], prediction))
    return

def traverse_tree(node, row):
    if row[node['index']] < node['split_value']:
        if isinstance(node['left'], dict):
            return traverse_tree(node['left'], row)
        else:
            return node['left']
    else:
        if isinstance(node['right'], dict):
            return traverse_tree(node['right'], row)
        else:
            return node['right']

def make_prediction(forest, row):
    list_of_classes = []
    for tree_root in forest:
        list_of_classes.append(traverse_tree(tree_root, row))
    return max(set(list_of_classes), key=list_of_classes.count)
There are some hyper-parameters in the implementation above, for example what is the correct tree depth or minimum node size? Is it really correct to pick one feature randomly or should we split by multiple features? How many trees should the forest contain? All of these questions can be addressed with the dataset itself and do not actually need to be determined by a fitting procedure. If there is interest, I can go into details in a future blog post.

If you ever want to use a random forest classifier you should, of course, consider just using the sklearn learn implementation, which reduces the entire code above to 3 lines:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100)
forest.fit(dataset_features, dataset_classes)
I hope this exercise gave you some insight into what is hidden behind the sklearn black box. The entire code is also available on GitHub.
best
Florian

Useful resources:
Blog post by Jason Brownlee on building decision trees: https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python

Monday, December 18, 2017

Web scraping with Python tutorial, part 2 -- scrapy.py

This is the second part of this tutorial about web scraping with Python. In the first part we looked at scraping static content from a particular website and we used the syntax of that website to access the desired information. In this part, we will look at web scrapers that potentially could search the entire internet.

It would be quite simple to extend the scraper we already developed to search the entire internet rather than just one website. All that we need to do is search for links on that website and after processing one site we follow one of the links found on that site. If we build up a list of links that way it is very unlikely that we will ever run out of links.

Rather than building such a scraper ourselves, we are going to make use of scrapy.py, a Python package exactly for the purpose of widespread web scraping. Install scrapy with

    pip install scrapy

and create a new project with

    scrapy startproject email_scraper

Now there should be a folder called email_scraper. Under email_scraper/spiders we have to write our scraping commands. I wrote a short spider called email_spider.py
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re
# scrapy crawl email_spider
def write_to_file(emails):
    with open("list_of_emails.csv", "a") as myfile:
        for email in emails:
            myfile.write("%s\n" % email)
    myfile.close
    return 

class PureEvil(CrawlSpider):
    name = "email_spider"
    download_delay = 1.0
    allowed_domains = ['stackoverflow.com'] 
    start_urls = ['https://stackoverflow.com/']    rules = [Rule(
        LinkExtractor(allow=".*"),
        callback='parse_page', 
        follow=True)
    ]
    def parse_page(self, response):
        emails = re.findall(r'[\w\.-]+@[\w\.-]+', response.body.decode('utf-8'))
        if emails:
            write_to_file(emails)
        return
This spider is named email_spider and if you go through the lines of code you can see why. It searches for email addresses in the source code and stores them in a CSV file. It is limited to domains within the site stackoverflow.com and will start at the main site. It will automatically store all the links on that site and will programmatically follow these links until it crawled the entire domain space.

You can see how easy it is to write a scraper with scrapy. Just a few lines of code give us a very powerful tool. Let me know if you have comments/questions below.
cheers
Florian

Thursday, November 30, 2017

Web scraping with Python tutorial, part 1 -- BeautifulSoup

In this tutorial, I will explain how to scrape content from a website using Python. In this first part, we will scrape the content of a static page.

First, you should check whether scraping is really the best way to get the data you want. If there is an API available, that is usually a much more robust way to get the data. But if there is no API, web scraping is a very powerful tool to do the job.

We will use the package BeautifulSoup to handle the HTML. You can install this package with

    pip install beautifulsoup4

and include it in the Python script with
from bs4 import BeautifulSoup
As an example, I will scrape all job advertisements from the HEP job list. We can access a list of all job advertisements with
start_link = "https://inspirehep.net/search?ln=en&cc=Jobs&rg=0"
Here we will use urllib to obtain the raw HTML of this site and feed it into BeautifulSoup
html = urlopen(link)
BeautifulSoup(html.read(), "html.parser")
Web scraping relies on components which are outside our control, for example, we rely on the website being accessible (online) and we rely on the HTML structure not having changed too much.  For that reason, it is a good idea to write the scraper with very extensive exception handling. The two lines above can be re-written as
def get_soup(link):
    try:
        html = urlopen(link)
    except HTTPError as e: # The page is not found on the server
        print("ERROR: Internal server error!", e)
        return None 
    except URLError as e:
        print("ERROR: The server could not be found!", e)
        return None 
    else:
        return BeautifulSoup(html.read(), "html.parser")
Here we re-wrote the whole procedure as a function and catch exceptions in which the page or the server can not be found.

Now that we have the HTML code as a BeautifulSoup object we can access its components in several ways. The most important are .find() and .findAll(), which return the first matching HTML object or all HTML objects, respectively.

However, how do we know what HTML objects to search for? For that, we need to look at the HTML code. A good tool for that is the Chrome inspector tool (View > Developer > Developer Tools, then click on the inspector tool at the top left corner). You can move the mouse over an object on the page and you can directly see the associated HTML object on the left
Figure 1: Google Chrome Developer tools inspector. You can see the HTML
on the right with the object highlighted, which corresponds to the part
the mouse is currently pointing to on the left.
With this, you can find the HTML tags you want to access. Here we want to access the list of job ads and each job ad is embedded in a div tag with class="record_body". So let's find all div objects with that class.
list_of_jobs = soup.findAll("div", {"class": "record_body"})
Now we have a list, which contains all the div objects for the different jobs on that site. We can loop through that list and extract some more information
for job in list_of_jobs:
    job_info = {'posting_date': '', 'record_link': ''}
    strong_tag = job.find("strong")
    if strong_tag:
        job_info['posting_date'] = strong_tag.text
    else:
        print("WARNING: No posting date found?")

    a_tags = job.findAll("a")
    if a_tags[1]:
        job_info['institute'] = a_tags[1].text
        job_info['position'] = a_tags[1].findNext("span").text
    else:
        print("WARNING: No institute found?")
This website is an example of very badly written HTML, which makes scraping quite difficult. If there is structure in the site, with lots of attributes, like ids or classes, we can easily identify the objects we want. Here this is not the case. For example, the loop above assumes that the first strong tag is the posting date. Let's hope that the admin does not change the HTML by including more strong tags since that would break this scraper. However, I could not find a more reliable way to access the posting date.

We can also access the link to the detailed job advertisement with
job_info['record_link'] = a_tags[0].attrs['href']
So to get even more information about the job ads we can now visit the individual links
for i, record in enumerate(records):
    time.sleep(1)
    soup = get_soup(record['record_link'])
    if soup == None:
        print('ERROR: No soup object received for record %d' % i)
    else:
        record_details = soup.find("div", {"class": "detailed_record_info"})
        job_description_identifier = record_details.find("strong",
                                                         text="Job description: ")
        description = job_description_identifier.parent
        if description:
            record['description'] = str(description)\
                          .replace('<strong>Job description: </strong><br/>', '')
        else:
            print('WARNING: No description found')
Here we loop over all job ads, stored in the records list, obtain the HTML code with the get_soup() function defined above and scrape the job description. The scraper waits for one second within the loop, otherwise we could put some heavy load on the server and disrupt the website performance. Again the job description has to be scraped by searching for title headings rather than HTML attributes because this website does not offer much structure.

One thing to consider when scraping a site is the robots.txt file which you can access on the root of the site (e.g. https://inspirehep.net/robots.txt). In this file, the owner of the website specifies which content should not be accessed by scrapers (robots). You don't have to follow what is written in this file, but it is a good idea to look at it, to see whether what you want to do goes against the owner's intention. If you look at the file linked above you will find that while there are some sites excluded from robot access, the jobs site on HEP is not, so we are good.

You can get the entire code discussed in this post from my GitHub account. Let me know if you have comments/questions below.
cheers
Florian

Thursday, October 12, 2017

Study the Universe with Python tutorial, part 5 -- Monte Carlo Markov Chain

This is the fifth blog post in this series which discusses the Baryon Oscillation Spectroscopic dataset (BOSS). In the last 4 posts we downloaded the data, calculated the power spectrum and covariance matrix and isolated the BAO feature. Now we want to apply this procedure to the data and extract cosmological information.

In the last post we used the scipy minimize function to find the best fitting model. However the best fitting model is not very useful, if we don't know the uncertainties attached to it.

What we really need is a probability for each possible model which could describe our Universe. The probability for model $Y$ with parameters x should reflect the probability that we would observe the BOSS dataset if the Universe were described by $Y(x)$. Such a probability distribution is called a likelihood and a very powerful way to get such a likelihood is a Monte Carlo Markov Chain (MCMC).

What is an MCMC? An MCMC is a chain of parameter sets which has the nice property that the appearance of each parameter in the chain also represents its likelihood distribution. This means one can take the chain, produce a histogram of parameter x which directly provides the (marginalised) likelihood for parameter x (marginalised just means that you integrate over the likelihood of all other parameters).

There are many algorithms which can produce such a chains, but the most basic and well known is the Metropolis-Hastings algorithm. Here we will just focus on the python implementation of such a procedure, but if you have never encountered MCMC it is certainly worth learning about it.

We will use the MCMC emcee package which can be installed with
>> pip install emcee
and included with
import emcee
We can call this MCMC sampler like
nwalkers = 20
dim = len(start)
expected_error = [0.1, 1., 1., 1., 1., 1., 0.05, 0.1]
sampler = emcee.EnsembleSampler(nwalkers=nwalkers, dim=dim, lnpostfn=get_loglike,\
 args=(boss_data, { 'noBAO': Pk_without_BAO, 'os_model': os_model }, get_shifted_model))
start_positions = [(result['x'] + (2.*np.random.random_sample((dim,)) - 1)*expected_error) for i in range(nwalkers)]
sampler.run_mcmc(start_positions, 1000)
Here we have to specify the number of walkers (nwalkers) since this MCMC package will produce several MCMC chains in parallel (each walker produces one chain). The separate chains produced by each walker are not really independent, since they communicate with each other (Goodman & Weare, 2010). We also need to specify the starting values for the different walkers. To do that we use the results from the minimisation procedure discussed in the last post and add a random variation around it. Different to our minimisation function in the last blog post, this algorithm considers high values a good match. Therefore we have to write a small function to turn our $\chi^2$ into a log-likelihood, which is defined as
def get_loglike(parameters, data, templates, func):
    return -0.5*calc_chi2(parameters, data, templates, func)
The run_mcmc() method produces a chain with 1000 rows times the number of walkers.

The problem with MCMC is that it requires a certain time to converge. This means the number of appearances of a parameter only reflects the likelihood if the chain has converged. Determining the convergence is very tricky, and there is no procedure which is guaranteed to detect the correct convergence. Here we will use the Gelman & Rubin criterium.

The Gelman & Rubin criterium compares several independent chains and tries to determine whether the variances within these chains are similar. Here we will not discuss the details of the implementation, but you can read more about it in this interesting blog post.

To use the Gelman & Rubin criterium we need to use several MCMC chains and compare them. The chains are independent, thus usually ran in parallel to save time. This means we have to re-write the MCMC procedure
def runMCMC(start, data, templates):
    ''' Perform MCMC '''
    dim = len(start)
    Nchains = 4
    nwalkers = 20
    ichaincheck = 400
    minlength = 2000
    epsilon = 0.01

    labels = ['b', 'A1', 'A2', 'A3', 'A4', 'A5', 'alpha', 'sigmaNL']
    expected_error = [0.1, 1., 1., 1., 1., 1., 0.05, 0.1]
    # Set up the sampler.
    pos=[]
    list_of_samplers=[]
    for jj in range(0, Nchains):
        pos.append([start + (2.*np.random.random_sample((dim,)) - 1.)*expected_error for i in range(nwalkers)])
        list_of_samplers.append(emcee.EnsembleSampler(nwalkers=nwalkers, dim=dim, lnpostfn=get_loglike,\
         args=(data, templates, get_shifted_model)))

    # Start MCMC
    print("Running MCMC... ")
    within_chain_var = np.zeros((Nchains, dim))
    mean_chain = np.zeros((Nchains, dim))
    scalereduction = np.arange(dim, dtype=np.float)
    scalereduction.fill(2.)

    itercounter = 0
    chainstep = minlength
    while any(abs(1. - scalereduction) > epsilon):
        itercounter += chainstep
        for jj in range(0, Nchains):
            for result in list_of_samplers[jj].sample(pos[jj], iterations=chainstep, storechain=True):
                pos[jj] = result[0]
            # we do the convergence test on the second half of the current chain (itercounter/2)
            within_chain_var[jj], mean_chain[jj] = prep_gelman_rubin(list_of_samplers[jj])
        scalereduction = gelman_rubin_convergence(within_chain_var, mean_chain, int(itercounter/2))
        print("scalereduction = ", scalereduction)
        chainstep = ichaincheck
    # Investigate the chain and print out some metrics
    inspect_chain(list_of_samplers, labels)
    return 
This function basically does the same thing as the function we had before, but it now runs within a loop, which is only broken when the Gelman & Rubin parameters in the scalereduction array are within a certain range of unity.

The inspect_chain() function provides several diagnostics, with which we can further test whether the chain has actually converged. For example we can print out the timeline of the sampling range for each parameter
fig, axes = plt.subplots(dim, 1, sharex=True, figsize=(8, 9))
for i in range(0, dim):
    for jj in range(0, Nchains):
        axes[i].plot(list_of_samplers[jj].chain[:, :, i].T, alpha=0.4)
    #axes[i].yaxis.set_major_locator(MaxNLocator(5))
    axes[i].set_ylabel(labels[i])
fig.tight_layout(h_pad=0.0)
fig.savefig("%stime_series.png" % output_path)
which gives
Figure 1: Variance in the different parameters. The convergence
has been reached if the variance is constant with time. In the early
stages of the chain the variance changes, indicating the the
convergence has not been reached.
This figure shows that the sampling range varies in the beginning, which indicates that the chain has not converged yet. The second half of the chain has an almost constant variance which is a good indication of convergence.

If we convinced ourselves that the chain has indeed converged we can plot the likelihood. Here we use corner.py which can be installed with
>> pip install corner
and included with
import corner
We can then feed our chain into this plotting package as
fig = corner.corner(mergedsamples, quantiles=[0.16, 0.5, 0.84], plot_density=False,\
    show_titles=True, title_fmt=".3f", labels=labels)
fig.savefig("%scorner.png" % output_path)
which gives
Figure 2: 2D likelihood distributions for the 7 parameters in our fit.
The 1D marginalised likelihood is given on the right together with the
$68\%$ confidence levels. 
Finally we can also print out the constraints with
def get_percentiles(chain, labels):
    ''' Calculate constraints and uncertainties from MCMC chain '''
    per = np.percentile(chain, [50., 15.86555, 84.13445, 2.2775, 97.7225], axis=0)
    per = [per[0], per[0]-per[1], per[2]-per[0], per[0]-per[3], per[4]-per[0]]
    for i in range(0, len(per[0])):
        print("%s = %f +%f -%f +%f -%f" % (labels[i], per[0][i], per[1][i], per[2][i], per[3][i], per[4][i]))
    return per
and with the BOSS data we should get something like
>> b = 2.295306 +0.314054 -0.312351 +0.584276 -0.610636
A1 = -0.151418 +0.111546 -0.111835 +0.223292 -0.232080
A2 = 41.676085 +16.889086 -17.505383 +30.571086 -37.981033
A3 = -3306.605503 +1642.375097 -1467.782597 +3487.705208 -2538.709703
A4 = 14835.170765 +5912.070481 -6608.236429 +10122.305967 -14237.721237
A5 = -21141.535952 +9150.798840 -8221.692166 +20105.713945 -14085.031087
alpha = 0.990088 +0.032634 -0.027241 +0.077669 -0.059837
sigmaNL = 11.378695 +2.975023 -4.083891 +5.036540 -7.682274
Now what does that mean in terms of cosmological parameters? Well as mentioned before, the only parameter of interest here is the scaling parameter $\alpha$. This parameter is defined as
\[
\alpha = \frac{D_V}{D^{\rm fid}_V}
\]where the the distance $D_V$ is defined as
\[
D_V = \left[(1+z)^2D^2_A\frac{cz}{H(z)}\right]^{1/3}
\]The superscribed $^{\rm fid}$ just means that this is the same quantity in the fiducial cosmology, which is the one we defined in the cosmo class in nbodykit.

The distance $D_V(z)$ is an average of the angular diameter distance $D_A(z)$ and the Hubble parameter $H(z)$
\[
\begin{align}
D_A(z) &= \frac{c}{1+z}\int^z_0\frac{dz'}{H(z')}\\
H(z) &= H_0\left[(\Omega_cdm+\Omega_b)(1+z)^3 + \Omega_{\Lambda}\right]^{1/2}
\end{align}
\]With this we can now turn the constraint on $\alpha$ into a constraint on $D_V$
Hz = cosmo.efunc(zmean)*cosmo.H0*cosmo.h
DC = (1.+zmean)*cosmo.angular_diameter_distance(zmean)/cosmo.h
c_km_s = (speed_of_light/1000.)
DVfid = ( DC**2*(zmean*c_km_s/Hz) )**(1./3.)
DV = per[:,6]*DVfid
print("DV = %f +%f -%f +%f -%f\n" % (DV[0], DV[1], DV[2], DV[3], DV[4]))
Which should show something like
>> DV = 2152.179924 +69.092697 -57.551602 +164.343396 -121.202709
The next step would be to combine this constraint with measurements from the Cosmic Microwave Background (CMB), to get constraints on actual cosmological parameters. But for now I will leave it here.

Thanks a lot for making it to the end of this series. If you are interested in going further, you should consider a few extensions to this analysis:
  • Have a look at density field reconstruction
  • Consider an anisotropic analysis to constrain $D_A(z)$ and $H(z)$ independently
  • Do you think the polynomial model we use is the best choice? Can you come up with a better model?
  • Use the simulated BOSS catalogues to test whether our procedure is robust
The final code for this analysis is available on GitHub.
cheers
Florian

Study the Universe with Python tutorial, part 4 -- minimizing $\chi^2$

This is the fourth blog post about how to use the Baryon Oscillation Spectroscopic (BOSS) dataset to constrain cosmology. In the first three posts, we downloaded the data, calculated the power spectrum and obtained a covariance matrix.

The power spectrum we measured carries a huge amount of information about cosmological parameters, but most of that information cannot be accessed easily. The reason is that we cannot calculate reliable models for the power spectrum, to compare to the data. In a previous post, we calculated models using the class code, but these models are only accurate on very large scales (small wavenumbers).

For that reason, we will focus here on one particular signal, which can be described with a very simple model. This signal is called Baryon Acoustic Oscillations (BAO). BAO in the galaxy distribution were detected for the first time in 2005 and since that first detection, BAO have become one of the most powerful cosmological probes. In fact, the BAO measurement is one of the main reasons why the BOSS dataset has been assembled.

The BAO signal shows up as an oscillation signal on top of the power spectrum. In a previous plot, we saw model power spectra for different amounts of dark matter
Figure 1: Model power spectrum for different amounts
of dark matter.
You can see the BAO signal in both power spectra as an oscillation. This is the signal we are after. The BAO signal is much more robust (easier to model) compared to the rest of the power spectrum and therefore we will use a model for the power spectrum where we get rid of all but the oscillation signal.

We can do that by fitting a power spectrum without a BAO signal to a power spectrum with BAO signal. Fitting just means that we test many different power spectra and pick the one which matches the best.

The nbodykit package provides power spectra without BAO signal, all that we have to do is call the LinearPower() function with the transfer parameter 'NoWiggleEisensteinHu'.
Pk = nbodykit.lab.cosmology.power.linear.LinearPower(cosmo, redshift=0, transfer='NoWiggleEisensteinHu')
And if we plot such a power spectrum we get
Figure 2: Same power spectra as in Figure 1, but calculated
with the transfer parameter 'NoWiggleEisensteinHu', which
calculates the power spectrum without BAO signal. You can see
that the BAO wiggles have been removed.
Comparing the power spectra in Figure 1 and Figure 2 shows that these power spectra are the same except for the BAO wiggles, which have been removed.

Our model uses the noBAO power spectrum together with 5 additional polynomial terms. These parameters have no physical motivation, but their whole purpose is to make this fit independent of the broadband and focus on the oscillation signature.
\[
P^{model}(k) = p^2_0P^{noBAO}(k) + \frac{p_1}{k^3} + \frac{p_2}{k^2}+ \frac{p_3}{k} + p_4 + p_5k.
\]
and in Python
def get_smooth_model(parameters, x, templates):
    ''' Combine a noBAO model with polynomials and linear bias '''
    polynomials = parameters[1]/x**3 + parameters[2]/x**2 + parameters[3]/x + parameters[4] + parameters[5]*x
    return parameters[0]*parameters[0]*templates['noBAO'](x) + polynomials
With this, we can now extract the oscillation signature from any power spectrum model
def get_oscillation(krange, Pk_class, Pk_without_BAO):
    ''' Get an oscillation only power spectrum '''
    cov_inv = np.identity(len(krange))
    start = [1., 0., 0., 0., 0., 0.]
    result = op.minimize(calc_chi2, start, args=( [{ 'k': krange, 'pk': Pk_class(krange), 'cov_inv': cov_inv }],\
     { 'noBAO': Pk_without_BAO }, get_smooth_model))
    yolin = Pk_class(krange)/get_smooth_model(result["x"], krange, { 'noBAO': Pk_without_BAO })
    return interp.interp1d(krange, yolin)
Here we use the minimize() function from the scipy.optimize package
import scipy.optimize as op
We also need to define a metric, meaning we need to specify how we distinguish good models from bad models. Here we use the $\chi^2$, which we can calculate as
def calc_chi2(parameters, data, templates, func):
    ''' Compares the model with the data '''
    if within_priors(parameters):
        chi2 = 0.
        # Loop over all datasets which are fit together
        for dataset in data:
            model = func(parameters, dataset['k'], templates)
            diff = (model - dataset['pk'])
            chi2 += np.dot(diff,np.dot(dataset['cov_inv'],diff))
        return chi2
    else:
        return 100000.
The within_priors() function tests whether the parameters are within a certain range (prior), otherwise it returns a very large $\chi^2$.

Here is a potential output of the get_oscillation() function
Figure 3: The galaxy power spectrum divided by the best fitting
no-BAO power spectrum. With this technique, we can isolate the
BAO feature.
Our procedure removed the entire broadband power spectrum and isolated the oscillation signature. Before we apply this to the data, we have to extend the model
def get_shifted_model(parameters, x, templates):
    ''' Calculate a model including a scaling parameter '''
    model = get_smooth_model(parameters, x, templates)
    return model*(1. + (templates['os_model'](x/parameters[6]) - 1.)*np.exp(-0.5*x**2*parameters[7]**2))
Here we get the smooth model we discussed above, but modify it by introducing a smoothing scale (parameters[7]) and a scaling parameter (parameters[6]). The scaling parameter is the parameter we are actually interested in, since it carries the wavelength of the oscillation signature and is the only parameter here which we will exploit for cosmological constraints.

Now let's apply this to the data.
Pk_without_BAO = nbodykit.lab.cosmology.power.linear.LinearPower(cosmo, redshift=0, transfer='NoWiggleEisensteinHu')
Pk_class = nbodykit.lab.cosmology.power.linear.LinearPower(cosmo, redshift=0, transfer='CLASS')
krange = np.arange(0.001, 0.5, 0.001)
os_model = get_oscillation(krange, Pk_class, Pk_without_BAO)
start = [2.37, -0.076, 38., -3547., 15760., -22622., 1., 9.41]
result = op.minimize(calc_chi2, start, args=(boss_data, { 'noBAO': Pk_without_BAO, 'os_model': os_model },\
get_shifted_model))
print("result['x'] = ", result['x'])
And with this, you should get something like this
Figure 4: BOSS measurements (data points) together with the
best fitting model.
And the best fitting parameters are
>> result['x'] =  [  2.35228409e+00  -1.40192579e-01   4.30619066e+01  -3.57546008e+03

   1.58788229e+04  -2.25314741e+04   9.96151333e-01   9.47806970e+00]
The problem is that we have no uncertainties attached to these parameters; we only have the best fitting values. Without uncertainties, we cannot use this to obtain cosmological constraints. So the next blog post will extend this by not just finding the best fitting parameters, but also getting the uncertainties.

The code for this project is available on GitHub. Let me know if you have any questions/comments below.
cheers
Florian