We can build a random forest from a training sample of size $M$ by following these 4 steps:
1. Randomly select $k << M$ samples with replacement from the dataset (bagging).
2. Build a decision tree: Pick a feature randomly from the set of $F$ available features and determine the split points with the highest information gain.
3. Continue picking features randomly until the decision tree reaches a maximum depth or has only leaf nodes.
4. Repeat steps 1-3 $B$ times to build a random forest classifier with $B$ trees.
Bagging or bootstrapping is a term which describes building sub-samples of a larger sample by picking a number of objects $k$ with replacement.
Using bootstrapped training samples is a way to decorrelate the decision trees. If the trees are uncorrelated, increasing the number of trees will always lead to a reduction in the variance of the random forest without increasing the bias.
The uncertainty of a random forest with $B$ uncorrelated decision trees can be calculated as the variance between the different trees
\[\sigma^2 = \frac{1}{B-1}\sum^B_i[f(x') - \overline{f}]^2\]
The training and test error tend to level off after some number of trees have been fit and can be determined using cross-validation.
Step 1: Bagging
Bagging or bootstrapping is a term which describes building sub-samples of a larger sample by picking a number of objects $k$ with replacement.
k_indices = np.random.choice(len(train), k) train[k_indices]
The uncertainty of a random forest with $B$ uncorrelated decision trees can be calculated as the variance between the different trees
\[\sigma^2 = \frac{1}{B-1}\sum^B_i[f(x') - \overline{f}]^2\]
The training and test error tend to level off after some number of trees have been fit and can be determined using cross-validation.
Step 2: Build a decision tree
We start with the root node. To fully specify this node we need to split the dataset into two daughter nodes. The way in which the dataset is split at each node is the main aspect that categorizes the tree we are building.
In our case, we determine the split point by the maximum information gain. The information gain is defined as
\[IG = I(D_p) - \frac{N_{left}}{N_p}I(D_{left}) - \frac{N_{right}}{N_p}I(D_{right})\]
and $I(D)$ can be the Gini index or Entropy:
\[
\begin{align}
I^{Gini}(D) &= 1 - \sum^{N_{classes}}_{i}p_i^2\\
I^{Entropy}(D) &= - \sum^{N_{classes}}_{i}p_i\log(p_i)
\end{align}
\]
where $p$ is the probability that a random pick from all elements in that node is a member of that class. For the rest of this blog post I will just focus on the Gini index.
Let's go through an example. Assume we want to build a decision tree to decide whether a person is female or male, based on a set of features. Here we have only two classes male and female. Have a look at the following two possible split points:
Given these two possibilities, let's calculate the information gain for split number 1:
\[\begin{align}
I^{Gini}(D_{parent}) &= 1 - \left[\left(\frac{40}{80}\right)^2 + \left(\frac{40}{80}\right)^2\right] = 1 - (0.5^2 + 0.5^2) = 0.5\\
I^{Gini}(D_{left}) &= 1 - \left[\left(\frac{30}{40}\right)^2 + \left(\frac{10}{40}\right)^2\right] = 1 - \left[\left(\frac{3}{4}\right)^2 + \left(\frac{1}{4}\right)^2\right] = \frac{3}{8} = 0.375\\
I^{Gini}(D_{right}) &= 1 - \left[\left(\frac{10}{40}\right)^2 + \left(\frac{30}{40}\right)^2\right] = 1 - \left[\left(\frac{1}{4}\right)^2 + \left(\frac{3}{4}\right)^2\right] = \frac{3}{8} = 0.375\\
IG &= 0.5 - \frac{40}{80}\times 0.375 - \frac{40}{80}\times 0.375 = 0.125
\end{align}\]
So the information gain is $0.125$. Let's do the same for split number 2:
\[\begin{align}
I^{Gini}(D_{parent}) &= 0.5 \text{ (as before)}\\
I^{Gini}(D_{left}) &= 1 - \left[\left(\frac{20}{60}\right)^2 + \left(\frac{40}{60}\right)^2\right] = 1 - \left[\left(\frac{1}{3}\right)^2 + \left(\frac{2}{3}\right)^2\right] = 1 - \frac{5}{9} = 0.44\\
I^{Gini}(D_{right}) &= 1 - \left[\left(\frac{20}{20}\right)^2 + \left(\frac{0}{20}\right)^2\right] = 1 - (1 + 0) = 0\\
IG &= 0.5 - \frac{60}{80}\times 0.44 - 0 = 0.17
\end{align}\]
So the second split is preferred. In general, splits which separate classes always have a large information gain, with a complete split between the classes corresponding to the maximum information gain of $0.5$. An information gain of $0.5$ means that you have all the information at that point, since all classes are separated. Therefore there is no point in growing the tree any deeper.
Here are the functions which calculate the information gain using the Gini index:
def calc_information_gain(groups, list_of_class_ids): # count all samples Nall = sum([len(group) for group in groups]) # calculate Gini index of parent node
all_rows = [row for group in groups for row in group] IG = calc_gini(all_rows, list_of_class_ids) # calculate Gini index of daughter nodes for group in groups: IG -= calc_gini(group, list_of_class_ids)*len(group)/Nall return IG def calc_gini(group, list_of_class_ids): Ngroup = len(group) if Ngroup == 0: return 0 dataset_class_ids = [row[-1] for row in group] sum_over_classes = 0. for class_id in list_of_class_ids: prob = dataset_class_ids.count(class_id)/Ngroup sum_over_classes += prob**2 return 1. - sum_over_classes
Note that building a tree like this is highly representative of the training dataset and hence is susceptible to overfitting. This is the reason why random forest classifiers build multiple trees with random subsets of the training dataset.
Splitting the dataset: Splitting a dataset involves iterating over all rows in the dataset, checking if the feature value is below or above the split value and assigning it to the left or right group, respectively.
The feature on which to split is chosen randomly when building a random forest. To find the value on which to split, we have to go through all possible splits and evaluate the information gain. This can be a very expensive calculation.
Our approach is to pick all possible values of the feature within the training dataset and evaluate the information gain for all of them. In that case, the cost will scale with the number of rows in the training dataset.
With the Gini function above and the test split function, we now have everything we need to evaluate all possible splits.
To build the tree we also need to decide how deep the tree should be. There are two possibilities: We can set a maximum tree depth from the root node or we set a minimum number of rows in a node as a stopping criteria. Here we will take both as input values for our tree building function.
To build the decision tree we need to implement the following steps:
1. Randomly determine the first splitting feature.
2. Split the original dataset according to the first optimal split using the get_split() function.
3. Recursively repeat step 1 and 2 on the daughter nodes until the tree has reached the maximum depth or either left or right group of rows is smaller than the minimum number of rows.
And here is the implementation
Splitting the dataset: Splitting a dataset involves iterating over all rows in the dataset, checking if the feature value is below or above the split value and assigning it to the left or right group, respectively.
def split_node(index, value, dataset): ''' Split the dataset into two using a feature index and
feature value ''' left = [] right = [] for row in dataset: if row[index] < value: left.append(row) else: right.append(row) return [left, right]
Our approach is to pick all possible values of the feature within the training dataset and evaluate the information gain for all of them. In that case, the cost will scale with the number of rows in the training dataset.
With the Gini function above and the test split function, we now have everything we need to evaluate all possible splits.
def get_split(dataset, index): ''' Evaluate all possible splits given the dataset and the index of the feature on which to split ''' list_of_class_ids = list(set(row[-1] for row in dataset)) split_value, max_IG, split_groups = 0., -1., None for row in dataset: groups = split_node(index, row[index], dataset) IG = calc_information_gain(groups, list_of_class_ids) if IG > max_IG: split_value, max_IG, split_groups = row[index], IG, groups return {'index': index,'split_value': split_value,'groups': groups }
Step 3: Building the decision tree
To build the decision tree we need to implement the following steps:
1. Randomly determine the first splitting feature.
2. Split the original dataset according to the first optimal split using the get_split() function.
3. Recursively repeat step 1 and 2 on the daughter nodes until the tree has reached the maximum depth or either left or right group of rows is smaller than the minimum number of rows.
And here is the implementation
def build_tree(train, max_depth, min_size): # randomly determine the feature index feature_index = int( np.random.random()*(len(right[0]) - 1) ) root = get_split(train, feature_index) split(root, max_depth, min_size, 1) return root def to_terminal(group): # Create a terminal node value list_of_classes = [row[-1] for row in group] return max(set(list_of_classes), key=list_of_classes.count) def split(node, max_depth, min_size, depth): left, right = node['groups'] del(node['groups']) # check for a no split if not left or not right: node['left'] = node['right'] = to_terminal(left + right) return # check for max depth if depth >= max_depth: node['left'] = to_terminal(left)
node['right'] = to_terminal(right) return # process left child if len(left) <= min_size: node['left'] = to_terminal(left) else: feature_index = int( np.random.random()*(len(right[0]) - 1) ) node['left'] = get_split(left, feature_index) split(node['left'], max_depth, min_size, depth+1) # process right child if len(right) <= min_size: node['right'] = to_terminal(right) else: feature_index = int( np.random.random()*(len(right[0]) - 1) ) node['right'] = get_split(right, feature_index) split(node['right'], max_depth, min_size, depth+1)
Step 4: Building the forest
Building the forest just means building multiple trees, so we call the build_tree() function multiple times with bootstrapped training samples
def build_forest(train, k, N_trees): max_depth = 4 min_size = 2 forest = [] for i in range(0, N_trees): # bootstrap training dataset k_indices = np.random.choice(len(train), k) forest.append(build_tree(train[k_indices], max_depth, min_size)) return forest
And that is it. We now have a random forest classifier in the form of a list of dictionaries. To use this classifier to classify a new object we just need to loop over the decison trees, store the tree results and the majority decision is what gives the classification.
Here is an implementation together with a test example:
There are some hyper-parameters in the implementation above, for example what is the correct tree depth or minimum node size? Is it really correct to pick one feature randomly or should we split by multiple features? How many trees should the forest contain? All of these questions can be addressed with the dataset itself and do not actually need to be determined by a fitting procedure. If there is interest, I can go into details in a future blog post.
If you ever want to use a random forest classifier you should, of course, consider just using the sklearn learn implementation, which reduces the entire code above to 3 lines:
Here is an implementation together with a test example:
def main(): train = np.array([[3.77,4.19,0], [4.77,1.169761413,0], [-5.,2.81281357,0], [3.1,2.61995032,0], [3.6,2.209014212,0], [1.2,-3.162953546,1], [2.3,-3.339047188,1], [5.6,0.476683375,1], [-1.3,-3.234550982,1], [2.1,-3.319983761,1]]) forest = build_forest(train, k=10, N_trees=100) for row in train: prediction = make_prediction(forest, row) print('truth = %d : prediction = %d' % (row[-1], prediction)) return def traverse_tree(node, row): if row[node['index']] < node['split_value']: if isinstance(node['left'], dict): return traverse_tree(node['left'], row) else: return node['left'] else: if isinstance(node['right'], dict): return traverse_tree(node['right'], row) else: return node['right'] def make_prediction(forest, row): list_of_classes = [] for tree_root in forest: list_of_classes.append(traverse_tree(tree_root, row)) return max(set(list_of_classes), key=list_of_classes.count)
If you ever want to use a random forest classifier you should, of course, consider just using the sklearn learn implementation, which reduces the entire code above to 3 lines:
from sklearn.ensemble import RandomForestClassifier forest = RandomForestClassifier(n_estimators=100) forest.fit(dataset_features, dataset_classes)
best
Florian
Useful resources:
Blog post by Jason Brownlee on building decision trees: https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python
Very Informative and creative contents. This concept is a good way to enhance knowledge. Thanks for sharing. Continue to share your knowledge through articles like these.
ReplyDeleteData Engineering Services
Data Analytics Services
Machine Learning Services
Data Modernization Services