Wednesday, December 20, 2017

Building a random forest classifier from scratch in Python

A random forest classifier uses decision trees to classify objects. The outcome of the individual decision tree results are counted and the one with the highest score is chosen. The reason for using multiple decision trees is to reduce overfitting, which is often present in decision trees.

We can build a random forest from a training sample of size $M$ by following these 4 steps:

1. Randomly select $k << M$ samples with replacement from the dataset (bagging).
2. Build a decision tree: Pick a feature randomly from the set of $F$ available features and determine the split points with the highest information gain.
3. Continue picking features randomly until the decision tree reaches a maximum depth or has only leaf nodes.
4. Repeat steps 1-3 $B$ times to build a random forest classifier with $B$ trees.

Step 1: Bagging


Bagging or bootstrapping is a term which describes building sub-samples of a larger sample by picking a number of objects $k$ with replacement.
k_indices = np.random.choice(len(train), k)
train[k_indices]
Using bootstrapped training samples is a way to decorrelate the decision trees. If the trees are uncorrelated, increasing the number of trees will always lead to a reduction in the variance of the random forest without increasing the bias.

The uncertainty of a random forest with $B$ uncorrelated decision trees can be calculated as the variance between the different trees
\[\sigma^2 = \frac{1}{B-1}\sum^B_i[f(x') - \overline{f}]^2\]
The training and test error tend to level off after some number of trees have been fit and can be determined using cross-validation.

Step 2: Build a decision tree


We start with the root node. To fully specify this node we need to split the dataset into two daughter nodes. The way in which the dataset is split at each node is the main aspect that categorizes the tree we are building.

In our case, we determine the split point by the maximum information gain. The information gain is defined as
\[IG = I(D_p) - \frac{N_{left}}{N_p}I(D_{left}) - \frac{N_{right}}{N_p}I(D_{right})\]
and $I(D)$ can be the Gini index or Entropy:
\[
\begin{align}
I^{Gini}(D) &= 1 - \sum^{N_{classes}}_{i}p_i^2\\
I^{Entropy}(D) &= - \sum^{N_{classes}}_{i}p_i\log(p_i)
\end{align}
\]
where $p$ is the probability that a random pick from all elements in that node is a member of that class. For the rest of this blog post I will just focus on the Gini index.

Let's go through an example. Assume we want to build a decision tree to decide whether a person is female or male, based on a set of features. Here we have only two classes male and female. Have a look at the following two possible split points:

Given these two possibilities, let's calculate the information gain for split number 1:
\[\begin{align}
I^{Gini}(D_{parent}) &= 1 - \left[\left(\frac{40}{80}\right)^2 + \left(\frac{40}{80}\right)^2\right] = 1 - (0.5^2 + 0.5^2) = 0.5\\
I^{Gini}(D_{left}) &= 1 - \left[\left(\frac{30}{40}\right)^2 + \left(\frac{10}{40}\right)^2\right] = 1 - \left[\left(\frac{3}{4}\right)^2 + \left(\frac{1}{4}\right)^2\right] = \frac{3}{8} = 0.375\\
I^{Gini}(D_{right}) &= 1 - \left[\left(\frac{10}{40}\right)^2 + \left(\frac{30}{40}\right)^2\right] = 1 - \left[\left(\frac{1}{4}\right)^2 + \left(\frac{3}{4}\right)^2\right] = \frac{3}{8} = 0.375\\
IG &= 0.5 - \frac{40}{80}\times 0.375 - \frac{40}{80}\times 0.375 = 0.125
\end{align}\]
So the information gain is $0.125$. Let's do the same for split number 2:
\[\begin{align}
I^{Gini}(D_{parent}) &= 0.5 \text{   (as before)}\\
I^{Gini}(D_{left}) &= 1 - \left[\left(\frac{20}{60}\right)^2 + \left(\frac{40}{60}\right)^2\right] = 1 - \left[\left(\frac{1}{3}\right)^2 + \left(\frac{2}{3}\right)^2\right] = 1 - \frac{5}{9} = 0.44\\
I^{Gini}(D_{right}) &= 1 - \left[\left(\frac{20}{20}\right)^2 + \left(\frac{0}{20}\right)^2\right] = 1 - (1 + 0) = 0\\
IG &= 0.5 - \frac{60}{80}\times 0.44 - 0 = 0.17
\end{align}\]
So the second split is preferred. In general, splits which separate classes always have a large information gain, with a complete split between the classes corresponding to the maximum information gain of $0.5$. An information gain of $0.5$ means that you have all the information at that point, since all classes are separated. Therefore there is no point in growing the tree any deeper.

Here are the functions which calculate the information gain using the Gini index:
def calc_information_gain(groups, list_of_class_ids):
    # count all samples
    Nall = sum([len(group) for group in groups])
    # calculate Gini index of parent node
    all_rows = [row for group in groups for row in group]
    IG = calc_gini(all_rows, list_of_class_ids)
    # calculate Gini index of daughter nodes
    for group in groups:
        IG -= calc_gini(group, list_of_class_ids)*len(group)/Nall
    return IG

def calc_gini(group, list_of_class_ids):
    Ngroup = len(group)
    if Ngroup == 0:
        return 0
    dataset_class_ids = [row[-1] for row in group]
    sum_over_classes = 0.
    for class_id in list_of_class_ids:
        prob = dataset_class_ids.count(class_id)/Ngroup
        sum_over_classes += prob**2
    return 1. - sum_over_classes
Note that building a tree like this is highly representative of the training dataset and hence is susceptible to overfitting. This is the reason why random forest classifiers build multiple trees with random subsets of the training dataset.

Splitting the dataset: Splitting a dataset involves iterating over all rows in the dataset, checking if the feature value is below or above the split value and assigning it to the left or right group, respectively.
def split_node(index, value, dataset):
    ''' Split the dataset into two using a feature index and 
    feature value '''
    left = []
    right = []
    for row in dataset:
        if row[index] < value:
            left.append(row)
        else:
            right.append(row)
    return [left, right]
The feature on which to split is chosen randomly when building a random forest. To find the value on which to split, we have to go through all possible splits and evaluate the information gain. This can be a very expensive calculation.

Our approach is to pick all possible values of the feature within the training dataset and evaluate the information gain for all of them. In that case, the cost will scale with the number of rows in the training dataset.

With the Gini function above and the test split function, we now have everything we need to evaluate all possible splits.
def get_split(dataset, index):
    ''' Evaluate all possible splits given the dataset and the index of 
    the feature on which to split '''
    list_of_class_ids = list(set(row[-1] for row in dataset))
    split_value, max_IG, split_groups = 0., -1., None
    for row in dataset:
        groups = split_node(index, row[index], dataset)
        IG = calc_information_gain(groups, list_of_class_ids)
        if IG > max_IG:
            split_value, max_IG, split_groups = row[index], IG, groups
    return {'index': index,'split_value': split_value,'groups': groups }
To build the tree we also need to decide how deep the tree should be. There are two possibilities: We can set a maximum tree depth from the root node or we set a minimum number of rows in a node as a stopping criteria. Here we will take both as input values for our tree building function.

Step 3: Building the decision tree


To build the decision tree we need to implement the following steps:

1. Randomly determine the first splitting feature.
2. Split the original dataset according to the first optimal split using the get_split() function.
3. Recursively repeat step 1 and 2 on the daughter nodes until the tree has reached the maximum depth or either left or right group of rows is smaller than the minimum number of rows.

And here is the implementation
def build_tree(train, max_depth, min_size):
    # randomly determine the feature index
    feature_index = int( np.random.random()*(len(right[0]) - 1) )
    root = get_split(train, feature_index)
    split(root, max_depth, min_size, 1)
    return root

def to_terminal(group):
    # Create a terminal node value
    list_of_classes = [row[-1] for row in group]
    return max(set(list_of_classes), key=list_of_classes.count)

def split(node, max_depth, min_size, depth):
    left, right = node['groups']
    del(node['groups'])
    # check for a no split
    if not left or not right:
        node['left'] = node['right'] = to_terminal(left + right)
        return
    # check for max depth
    if depth >= max_depth:
        node['left'] = to_terminal(left)
        node['right'] = to_terminal(right)
        return
    # process left child
    if len(left) <= min_size:
        node['left'] = to_terminal(left)
    else:
        feature_index = int( np.random.random()*(len(right[0]) - 1) )
        node['left'] = get_split(left, feature_index)
        split(node['left'], max_depth, min_size, depth+1)
    # process right child
    if len(right) <= min_size:
        node['right'] = to_terminal(right)
    else:
        feature_index = int( np.random.random()*(len(right[0]) - 1) )
        node['right'] = get_split(right, feature_index)
        split(node['right'], max_depth, min_size, depth+1)

Step 4: Building the forest


Building the forest just means building multiple trees, so we call the build_tree() function multiple times with bootstrapped training samples
def build_forest(train, k, N_trees):
    max_depth = 4
    min_size = 2
    forest = []
    for i in range(0, N_trees):
        # bootstrap training dataset
        k_indices = np.random.choice(len(train), k)
        forest.append(build_tree(train[k_indices], max_depth, min_size))
    return forest
And that is it. We now have a random forest classifier in the form of a list of dictionaries. To use this classifier to classify a new object we just need to loop over the decison trees, store the tree results and the majority decision is what gives the classification.

Here is an implementation together with a test example:
def main():
    train = np.array([[3.77,4.19,0],
    [4.77,1.169761413,0],
    [-5.,2.81281357,0],
    [3.1,2.61995032,0],
    [3.6,2.209014212,0],
    [1.2,-3.162953546,1],
    [2.3,-3.339047188,1],
    [5.6,0.476683375,1],
    [-1.3,-3.234550982,1],
    [2.1,-3.319983761,1]])
    forest = build_forest(train, k=10, N_trees=100)
    for row in train:
        prediction = make_prediction(forest, row)
        print('truth = %d : prediction = %d' % (row[-1], prediction))
    return

def traverse_tree(node, row):
    if row[node['index']] < node['split_value']:
        if isinstance(node['left'], dict):
            return traverse_tree(node['left'], row)
        else:
            return node['left']
    else:
        if isinstance(node['right'], dict):
            return traverse_tree(node['right'], row)
        else:
            return node['right']

def make_prediction(forest, row):
    list_of_classes = []
    for tree_root in forest:
        list_of_classes.append(traverse_tree(tree_root, row))
    return max(set(list_of_classes), key=list_of_classes.count)
There are some hyper-parameters in the implementation above, for example what is the correct tree depth or minimum node size? Is it really correct to pick one feature randomly or should we split by multiple features? How many trees should the forest contain? All of these questions can be addressed with the dataset itself and do not actually need to be determined by a fitting procedure. If there is interest, I can go into details in a future blog post.

If you ever want to use a random forest classifier you should, of course, consider just using the sklearn learn implementation, which reduces the entire code above to 3 lines:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100)
forest.fit(dataset_features, dataset_classes)
I hope this exercise gave you some insight into what is hidden behind the sklearn black box. The entire code is also available on GitHub.
best
Florian

Useful resources:
Blog post by Jason Brownlee on building decision trees: https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python

Monday, December 18, 2017

Web scraping with Python tutorial, part 2 -- scrapy.py

This is the second part of this tutorial about web scraping with Python. In the first part we looked at scraping static content from a particular website and we used the syntax of that website to access the desired information. In this part, we will look at web scrapers that potentially could search the entire internet.

It would be quite simple to extend the scraper we already developed to search the entire internet rather than just one website. All that we need to do is search for links on that website and after processing one site we follow one of the links found on that site. If we build up a list of links that way it is very unlikely that we will ever run out of links.

Rather than building such a scraper ourselves, we are going to make use of scrapy.py, a Python package exactly for the purpose of widespread web scraping. Install scrapy with

    pip install scrapy

and create a new project with

    scrapy startproject email_scraper

Now there should be a folder called email_scraper. Under email_scraper/spiders we have to write our scraping commands. I wrote a short spider called email_spider.py
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re
# scrapy crawl email_spider
def write_to_file(emails):
    with open("list_of_emails.csv", "a") as myfile:
        for email in emails:
            myfile.write("%s\n" % email)
    myfile.close
    return 

class PureEvil(CrawlSpider):
    name = "email_spider"
    download_delay = 1.0
    allowed_domains = ['stackoverflow.com'] 
    start_urls = ['https://stackoverflow.com/']    rules = [Rule(
        LinkExtractor(allow=".*"),
        callback='parse_page', 
        follow=True)
    ]
    def parse_page(self, response):
        emails = re.findall(r'[\w\.-]+@[\w\.-]+', response.body.decode('utf-8'))
        if emails:
            write_to_file(emails)
        return
This spider is named email_spider and if you go through the lines of code you can see why. It searches for email addresses in the source code and stores them in a CSV file. It is limited to domains within the site stackoverflow.com and will start at the main site. It will automatically store all the links on that site and will programmatically follow these links until it crawled the entire domain space.

You can see how easy it is to write a scraper with scrapy. Just a few lines of code give us a very powerful tool. Let me know if you have comments/questions below.
cheers
Florian