My statistics blog: machine learning

Showing posts with label machine learning. Show all posts

Wednesday, December 20, 2017

Building a random forest classifier from scratch in Python

A random forest classifier uses decision trees to classify objects. The outcome of the individual decision tree results are counted and the one with the highest score is chosen. The reason for using multiple decision trees is to reduce overfitting, which is often present in decision trees.

We can build a random forest from a training sample of size $M$ by following these 4 steps:

1. Randomly select $k << M$ samples with replacement from the dataset (bagging).
2. Build a decision tree: Pick a feature randomly from the set of $F$ available features and determine the split points with the highest information gain.
3. Continue picking features randomly until the decision tree reaches a maximum depth or has only leaf nodes.

4. Repeat steps 1-3 $B$ times to build a random forest classifier with $B$ trees.

Step 1: Bagging

Bagging or bootstrapping is a term which describes building sub-samples of a larger sample by picking a number of objects $k$ with replacement.

k_indices = np.random.choice(len(train), k)
train[k_indices]

Using bootstrapped training samples is a way to decorrelate the decision trees. If the trees are uncorrelated, increasing the number of trees will always lead to a reduction in the variance of the random forest without increasing the bias.

The uncertainty of a random forest with $B$ uncorrelated decision trees can be calculated as the variance between the different trees
\[\sigma^2 = \frac{1}{B-1}\sum^B_i[f(x') - \overline{f}]^2\]
The training and test error tend to level off after some number of trees have been fit and can be determined using cross-validation.

Step 2: Build a decision tree

We start with the root node. To fully specify this node we need to split the dataset into two daughter nodes. The way in which the dataset is split at each node is the main aspect that categorizes the tree we are building.

In our case, we determine the split point by the maximum information gain. The information gain is defined as
\[IG = I(D_p) - \frac{N_{left}}{N_p}I(D_{left}) - \frac{N_{right}}{N_p}I(D_{right})\]
and $I(D)$ can be the Gini index or Entropy:
\[
\begin{align}
I^{Gini}(D) &= 1 - \sum^{N_{classes}}_{i}p_i^2\\
I^{Entropy}(D) &= - \sum^{N_{classes}}_{i}p_i\log(p_i)
\end{align}
\]
where $p$ is the probability that a random pick from all elements in that node is a member of that class. For the rest of this blog post I will just focus on the Gini index.

Let's go through an example. Assume we want to build a decision tree to decide whether a person is female or male, based on a set of features. Here we have only two classes male and female. Have a look at the following two possible split points:

Given these two possibilities, let's calculate the information gain for split number 1:
\[\begin{align}
I^{Gini}(D_{parent}) &= 1 - \left[\left(\frac{40}{80}\right)^2 + \left(\frac{40}{80}\right)^2\right] = 1 - (0.5^2 + 0.5^2) = 0.5\\
I^{Gini}(D_{left}) &= 1 - \left[\left(\frac{30}{40}\right)^2 + \left(\frac{10}{40}\right)^2\right] = 1 - \left[\left(\frac{3}{4}\right)^2 + \left(\frac{1}{4}\right)^2\right] = \frac{3}{8} = 0.375\\
I^{Gini}(D_{right}) &= 1 - \left[\left(\frac{10}{40}\right)^2 + \left(\frac{30}{40}\right)^2\right] = 1 - \left[\left(\frac{1}{4}\right)^2 + \left(\frac{3}{4}\right)^2\right] = \frac{3}{8} = 0.375\\
IG &= 0.5 - \frac{40}{80}\times 0.375 - \frac{40}{80}\times 0.375 = 0.125
\end{align}\]
So the information gain is $0.125$. Let's do the same for split number 2:
\[\begin{align}
I^{Gini}(D_{parent}) &= 0.5 \text{ (as before)}\\
I^{Gini}(D_{left}) &= 1 - \left[\left(\frac{20}{60}\right)^2 + \left(\frac{40}{60}\right)^2\right] = 1 - \left[\left(\frac{1}{3}\right)^2 + \left(\frac{2}{3}\right)^2\right] = 1 - \frac{5}{9} = 0.44\\
I^{Gini}(D_{right}) &= 1 - \left[\left(\frac{20}{20}\right)^2 + \left(\frac{0}{20}\right)^2\right] = 1 - (1 + 0) = 0\\
IG &= 0.5 - \frac{60}{80}\times 0.44 - 0 = 0.17
\end{align}\]
So the second split is preferred. In general, splits which separate classes always have a large information gain, with a complete split between the classes corresponding to the maximum information gain of $0.5$. An information gain of $0.5$ means that you have all the information at that point, since all classes are separated. Therefore there is no point in growing the tree any deeper.

Here are the functions which calculate the information gain using the Gini index:

def calc_information_gain(groups, list_of_class_ids):
    # count all samples
    Nall = sum([len(group) for group in groups])
    # calculate Gini index of parent node

    all_rows = [row for group in groups for row in group]
    IG = calc_gini(all_rows, list_of_class_ids)
    # calculate Gini index of daughter nodes
    for group in groups:
        IG -= calc_gini(group, list_of_class_ids)*len(group)/Nall
    return IG

def calc_gini(group, list_of_class_ids):
    Ngroup = len(group)
    if Ngroup == 0:
        return 0
    dataset_class_ids = [row[-1] for row in group]
    sum_over_classes = 0.
    for class_id in list_of_class_ids:
        prob = dataset_class_ids.count(class_id)/Ngroup
        sum_over_classes += prob**2
    return 1. - sum_over_classes

Note that building a tree like this is highly representative of the training dataset and hence is susceptible to overfitting. This is the reason why random forest classifiers build multiple trees with random subsets of the training dataset.

Splitting the dataset: Splitting a dataset involves iterating over all rows in the dataset, checking if the feature value is below or above the split value and assigning it to the left or right group, respectively.

def split_node(index, value, dataset):
    ''' Split the dataset into two using a feature index and

    feature value '''
    left = []
    right = []
    for row in dataset:
        if row[index] < value:
            left.append(row)
        else:
            right.append(row)
    return [left, right]

The feature on which to split is chosen randomly when building a random forest. To find the value on which to split, we have to go through all possible splits and evaluate the information gain. This can be a very expensive calculation.

Our approach is to pick all possible values of the feature within the training dataset and evaluate the information gain for all of them. In that case, the cost will scale with the number of rows in the training dataset.

With the Gini function above and the test split function, we now have everything we need to evaluate all possible splits.

def get_split(dataset, index):
    ''' Evaluate all possible splits given the dataset and the index of 
    the feature on which to split '''
    list_of_class_ids = list(set(row[-1] for row in dataset))
    split_value, max_IG, split_groups = 0., -1., None
    for row in dataset:
        groups = split_node(index, row[index], dataset)
        IG = calc_information_gain(groups, list_of_class_ids)
        if IG > max_IG:
            split_value, max_IG, split_groups = row[index], IG, groups
    return {'index': index,'split_value': split_value,'groups': groups }

To build the tree we also need to decide how deep the tree should be. There are two possibilities: We can set a maximum tree depth from the root node or we set a minimum number of rows in a node as a stopping criteria. Here we will take both as input values for our tree building function.

Step 3: Building the decision tree

To build the decision tree we need to implement the following steps:

1. Randomly determine the first splitting feature.
2. Split the original dataset according to the first optimal split using the get_split() function.
3. Recursively repeat step 1 and 2 on the daughter nodes until the tree has reached the maximum depth or either left or right group of rows is smaller than the minimum number of rows.

And here is the implementation

def build_tree(train, max_depth, min_size):
    # randomly determine the feature index
    feature_index = int( np.random.random()*(len(right[0]) - 1) )
    root = get_split(train, feature_index)
    split(root, max_depth, min_size, 1)
    return root

def to_terminal(group):
    # Create a terminal node value
    list_of_classes = [row[-1] for row in group]
    return max(set(list_of_classes), key=list_of_classes.count)

def split(node, max_depth, min_size, depth):
    left, right = node['groups']
    del(node['groups'])
    # check for a no split
    if not left or not right:
        node['left'] = node['right'] = to_terminal(left + right)
        return
    # check for max depth
    if depth >= max_depth:
        node['left'] = to_terminal(left)

        node['right'] = to_terminal(right)
        return
    # process left child
    if len(left) <= min_size:
        node['left'] = to_terminal(left)
    else:
        feature_index = int( np.random.random()*(len(right[0]) - 1) )
        node['left'] = get_split(left, feature_index)
        split(node['left'], max_depth, min_size, depth+1)
    # process right child
    if len(right) <= min_size:
        node['right'] = to_terminal(right)
    else:
        feature_index = int( np.random.random()*(len(right[0]) - 1) )
        node['right'] = get_split(right, feature_index)
        split(node['right'], max_depth, min_size, depth+1)

Step 4: Building the forest

Building the forest just means building multiple trees, so we call the build_tree() function multiple times with bootstrapped training samples

def build_forest(train, k, N_trees):
    max_depth = 4
    min_size = 2
    forest = []
    for i in range(0, N_trees):
        # bootstrap training dataset
        k_indices = np.random.choice(len(train), k)
        forest.append(build_tree(train[k_indices], max_depth, min_size))
    return forest

And that is it. We now have a random forest classifier in the form of a list of dictionaries. To use this classifier to classify a new object we just need to loop over the decison trees, store the tree results and the majority decision is what gives the classification.

Here is an implementation together with a test example:

def main():
    train = np.array([[3.77,4.19,0],
    [4.77,1.169761413,0],
    [-5.,2.81281357,0],
    [3.1,2.61995032,0],
    [3.6,2.209014212,0],
    [1.2,-3.162953546,1],
    [2.3,-3.339047188,1],
    [5.6,0.476683375,1],
    [-1.3,-3.234550982,1],
    [2.1,-3.319983761,1]])
    forest = build_forest(train, k=10, N_trees=100)
    for row in train:
        prediction = make_prediction(forest, row)
        print('truth = %d : prediction = %d' % (row[-1], prediction))
    return

def traverse_tree(node, row):
    if row[node['index']] < node['split_value']:
        if isinstance(node['left'], dict):
            return traverse_tree(node['left'], row)
        else:
            return node['left']
    else:
        if isinstance(node['right'], dict):
            return traverse_tree(node['right'], row)
        else:
            return node['right']

def make_prediction(forest, row):
    list_of_classes = []
    for tree_root in forest:
        list_of_classes.append(traverse_tree(tree_root, row))
    return max(set(list_of_classes), key=list_of_classes.count)

There are some hyper-parameters in the implementation above, for example what is the correct tree depth or minimum node size? Is it really correct to pick one feature randomly or should we split by multiple features? How many trees should the forest contain? All of these questions can be addressed with the dataset itself and do not actually need to be determined by a fitting procedure. If there is interest, I can go into details in a future blog post.

If you ever want to use a random forest classifier you should, of course, consider just using the sklearn learn implementation, which reduces the entire code above to 3 lines:

from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100)
forest.fit(dataset_features, dataset_classes)

I hope this exercise gave you some insight into what is hidden behind the sklearn black box. The entire code is also available on GitHub.
best
Florian

Useful resources:
Blog post by Jason Brownlee on building decision trees: https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python

Tuesday, June 6, 2017

Building a recommendation engine for arXiv papers using Python, sklearn and NLP

In this blog post, I will describe how I build a recommendation engine for arXiv papers for the web application www.benty-fields.com.

benty-fields is used by researchers in physics, mathematics and computer science, to follow the latest publications and organize journal clubs. The website collects data on the publication's users like, for example, if they vote for a paper or put it in their personal library. Here I will use those data to build a recommendation engine.

There are two different use cases for this recommendation engine:

(1) When a user visits the website to read the latest publications, I want that list of publications ordered, so that the most interesting publications for this user are at the top.
(2) I want to present a list of publications to the user with the title 'papers you might have missed'. This list should include papers from the last 3 months, which the user has not shown any interest in yet (has not voted for or put it in her/his library), but which might be of interest.

We will use exactly the same model for both cases. So let's get started.

First, we have to think about what features we want to use to train a model. The data we have available are the papers the user liked in the past. These papers have abstracts, titles, authors, publication dates and a lot more information we could use. Here we will restrict the analysis to the abstract, title and author list. I tested other features but the most useful feature seems to be the abstract, with the title and author list adding marginal information.

First we turn the list of abstracts, titles and authors into a feature vector:

# We calculate the tf-idf for each feature
vectorizer = TfidfVectorizer(analyzer="word", tokenizer=None,

                             stop_words="english", max_features=5000) 
# fit_transform() serves two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list

# of strings.
train_data_features1 = vectorizer.fit_transform(list_of_abstracts)
train_data_features2 = vectorizer.fit_transform(list_of_titles)
train_data_features3 = vectorizer.fit_transform(list_of_authors)

I use the TfidfVectorizer() from the scikit-learn package to turn the text into a matrix of papers vs. term frequency-inverse document frequency (tf-idf) scores. tf-idf is looking at the frequency of terms normalized to how often the term appears in other documents. So if a term appears in every document, the tf-idf score is very small, since the term does not seem to have much distinctive power, while if the term only appears in a few documents, this term is very useful to separate papers into categories and therefore gets a large tf-idf score.

Before I feed the abstract, title and author list into the vectorizer I pre-process the data:

import nltk
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    # remove numbers
    text_wo_numbers = re.sub(r'\d+', '', text)
    # remove single letters
    text_wo_single_letters = re.sub(r'\b\w\b', ' ', text_wo_numbers)

    # remove punctuation 
    text_wo_punctuation = ' '.join(re.findall('\w+',

                                              text_wo_single_letters))

    tokens = nltk.word_tokenize(text_wo_punctuation)
    stems = stem_tokens(tokens, stemmer)
    return stems

def pre_process_text(list_of_texts):
    output_list_of_texts = []
    for text in list_of_texts:
        output_list_of_texts.append(' '.join( tokenize(text) ) )
    return output_list_of_texts

This code uses the nltk library to remove numbers from the abstract and title, as well as punctuation and one letter words. It also transforms the words into word stems (e.g. equations -> equat), which ensures that e.g. equation and equations maps to the same word stem. I leave the removal of stop words to TfidfVectorizer(). The author list is not processed in this way since it does not make much sense to turn names into stems.

Next, we can use these features to train a model. I am using a Random Forest model which I train for each user with the available papers for that user, as well as an equal size negative paper set (papers the user does not seem to be interested in). The negative paper set is picked from the same time period as the positive papers, but I made sure that none of those papers is present in the positive paper set.

# Now we get the negative paper examples
negative_papers = get_negative_arxiv_papers(user, len(input_papers))

paper_sample = input_papers + negative_papers
y = [1]*len(input_papers) + [0]*len(negative_papers)
y = np.array(y)
        
X1, X2, X3 = prepare_data(paper_sample, user, write=True)
X_train = np.c_[X1, X2, X3]

forest = RandomForestClassifier(n_estimators=100)
# forest now stores the fit information used later for the prediction
forest.fit(X_train, y)

The functions get_negative_arxiv_papers() is very specific to how I store the data and I will not discuss the details of these functions here. If you are interested in the details I posted the complete code on GitHub. The function prepare_data() contains the data cleaning and vectorization steps discussed above.

However, it is important to pick the correct set of negative papers. In my case I could, for example, pick just any paper in the arXiv library. The problem is that users follow only certain arXiv categories, so the fact that a user who follows the category 'Statistics' has never shown any interest in papers from the category 'Optics' is not very surprising since the user has probably never seen these papers.

The model developed here is supposed to order the papers in the category the user is looking at, so it has to distinguish between papers in the user specific category. Therefore I pick the negative papers also from this category. It might be interesting to train a more general model and suggest papers outside the user's specified field of interest, but this is not the purpose of this model. Note that these choices impact the performance of the model significantly, so I did a lot of tests to make sure to pick a representative negative dataset. It is important to pick the training data according to the questions you are trying to answer.

Ok, now that we have a model, we can test the model. One question we need to answer is when do we want to use the model. For many users, the available data to train the model is very small, just because many users are not very active. If the model is trained with small datasets, it might be better to put the model aside and instead recommend the most popular papers to the users.

So when should we use the model to derive recommendations and when is it better to instead use the most popular papers? Here is a plot which shows the precision of the random forest model as a function of the size of the training dataset.

Figure 1: Precision reached for the Random Forest model as a function of the size of the
training dataset for each user. The precision is calculated as the average of 10 cross-validation step.

Figure 1 clearly shows that the precision has larger and larger error bars with smaller datasets, and as soon as the error bar includes a precision of 0.5, the model basically behaves like a random selection. From this plot, I determined that having less than 50 papers to train the model does not result in a predictive model. I used that to distinguish active users, for whom I use the model to get paper recommendations, and less active users for whom I use the most popular papers as recommendations.

So here is the function which uses the random forest model to find the most popular papers

def get_tf_idf_ranking(papers, user):
    ''' This function is used to rank papers by user interest '''
    X1, X2, X3 = prepare_data(papers, user, read=True)
    X_pred = np.c_[X1, X2, X3]

    model_file = ("%d/recommendation_model_%d.pickle" %

                  (app.config['ML_FOLDER'], user.id))
    forest = pickle.load( open( model_file, "rb" ) )

    # Use the model for prediction
    y_proba = forest.predict_proba(X_pred)

    return y_proba[:,1]

I just load the random forest model and use its predict_proba() function. The preparation of the data in prepare_data() is very similar to what I did above, but now I read in the feature list instead of creating a new one.

My code to test the recommendation model is as follows:

def get_auc_through_cross_validation(user):
    ''' 
    This function calculates the area under the curve (AUC)

    for the paper prediction model
    '''
    input_papers = paper_click_data(user)
    negative_papers = get_negative_arxiv_papers(user, len(input_papers))

    paper_sample = input_papers + negative_papers
    y = [1]*len(input_papers) + [0]*len(negative_papers)
    y = np.array(y)
    
    X1, X2, X3 = prepare_data(paper_sample, user, write=False)
    X = np.c_[X1, X2, X3]
    
    metrics = { 
        'num_cases': len(X),   
        'curves': [],
        'aucs': [],
        'pr': []
    }
    cross_validation_steps = 10
    kf = KFold(n_splits=cross_validation_steps, shuffle=True)
    forest = RandomForestClassifier(n_estimators=100)

    for train, test in kf.split(X):
        X_train, X_test = X[train], X[test]
        y_train, y_test = y[train], y[test]

        forest.fit(X_train, y_train)
        probabilities = forest.predict_proba(X_test)[:,1]
        prec, rec, thresholds = precision_recall_curve(y_test,

                                                       probabilities,

                                                       pos_label=1)
        thresholds = np.append(thresholds, np.array([1]))

        fp_rate, tp_rate, thresholds2 = roc_curve(y_test,

                                                  probabilities,

                                                  pos_label=1)

        roc_auc = auc(false_positive_rate, true_positive_rate)
        print roc_auc
        if not math.isnan(roc_auc):
            av_pr = average_precision_score(y_test, probabilities)

            case_rate = []
            for threshold in thresholds:
                case_rate.append(np.mean(probabilities >= threshold))

            curves = {
                'thresholds': thresholds,
                'precision': precision,
                'recall': recall,
                'case_rate': case_rate,
                'fpr': false_positive_rate,
                'tpr': true_positive_rate
            }
            metrics['curves'].append(curves)
            metrics['aucs'].append(roc_auc)
            metrics['pr'].append(av_pr)

    plot_cross_validation_result(user, metrics)
    return metrics

This code calculates a large set of evaluation metrics, including a Receiver Operating Characteristic (ROC) and Precision-Recall plots for each model

Figure 2: Precision-Recall curve and ROC curve for the Random Forest model of

the most active user on benty-fields.com.

The different lines correspond to the 10 cross-validation steps. The area under the curve (AUC) tells us the predictive power of the model, with 1 being perfect, and 0.5 representing a random selection. Our AUC of about 0.8 is pretty good and given that we have the cross-validation results we can also calculate an error for this number.

The code to produce these plots is

def plot_cross_validation_result(user, results):
    # ROC Curve plot
    # average values for the legend
    auc_av = "{0:.2f}".format(np.mean(results['aucs']))
    auc_sd = "{0:.2f}".format(np.std(results['aucs']))

    plt.clf()
    plt.figure(2)
    # plot each individual ROC curve
    for chart in results['curves']:
        plt.plot(chart['fpr'], chart['tpr'], color='b', alpha=0.5)
    plt.plot([0,1],[0,1],'b--')
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title('Cross-validated ROC for user %d (sample size %d)' %

              (user.id, len(user.paper_click_objects)))
    plt.text(0.6, 0.1,r'AUC = {av} \pm {sd}'.format(av=auc_av,sd=auc_sd))
    plt.savefig("%s/user_like_roc_curve2_%d.png" %

                (app.config['STATISTICS_FOLDER'], user.id))

    # Precision-recall plot
    pr_av = "{0:.2f}".format(np.mean(results['pr']))
    pr_sd = "{0:.2f}".format(np.std(results['pr']))

    plt.clf()
    plt.figure(4)
    for chart in results['curves']:
        plt.plot(chart['recall'], chart['precision'],

                 color='b', alpha=0.5)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Cross-validated prec/rec for user %d (sample size %d)' %

              (user.id, len(user.paper_click_objects)))
    plt.text(0.6, 0.9,r'AUC = {av} \pm {sd}'.format(av=pr_av,sd=pr_sd))
    plt.savefig("%s/user_like_precision_recall_%d.png" %

                (app.config['STATISTICS_FOLDER'], user.id))

    return

Ok, that's it. A lot of code in this post but let me know if you have any comments. I also posted the entire code including some functions I skipped in this post on GitHub.

cheers
Florian

Monday, April 10, 2017

Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) -- finding useful features

In this blog post I will discuss how to use Singular Value Decomposition (SVD) and Principle Component Analysis (PCA) to determine the number of useful features in a feature vector. Whenever tackling a machine learning problem with a large number of available features, one should do a PCA as the first step, since in many cases reducing the dimensions of the problem can speed up the project significantly.

Ok, let's assume we have a table of data on properties, with the age of the house, the size in m$^2$, the size in feet$^2$ (for all the Americans out there) and the price:

Age [years]	size [in m$^{2}$]	size [in feet$^{2}$]	price [in US$]
50	80	861	1 000 000
45	95	1023	850 000
5	115	1238	400 000
12	125	1345	600 000
25	105	1130	800 000
2	120	1292	300 000
33	110	1184	1 000 000
43	75	807	1 200 000
42	145	1561	700 000
21	80	861	400 000
51	80	861	1 300 000
110	80	861	1 000 000
12	80	861	300 000
9	80	861	650 000

Of course this dataset is too small to learn anything useful, but it's big enough for me to show the use of SVD and PCA.

Assume we use the table above to train a machine learning model with the hope that we can predict prices for other houses given the age and size. That would be quite useful to have if you want to buy a house in the same area as the houses in the table, or if you run a real estate business.

Let's use all columns above as features to train our model. We start preparing the data:

X = np.matrix([[50., 80., 861., 1000000], 
               [45., 95., 1023., 850000], 
               [5., 115., 1238., 400000],
               [12., 125., 1345., 600000],
               [25., 105., 1130., 800000],
               [2., 120., 1292., 300000],
               [33., 110., 1184., 1000000],
               [43., 75., 807., 1200000],
               [42., 145., 1561., 700000],
               [21., 80., 861., 400000],
               [51., 80., 861., 1300000],
               [110., 80., 861., 1000000],
               [12., 80., 861., 300000],
               [9., 80., 861., 650000]])
# Lets subtract the mean of each column and divide the by range (feature scaling)
for col in range(0,X.shape[1]):
  X[:,col] = ( X[:,col] - np.mean(X[:,col]) )/( np.max(X[:,col]) - np.min(X[:,col]) )

Here we subtracted the mean of each column and divided by the range. This is important, since if we do not do this the price is dominating the variance, since it has in absolute terms much larger values than the age or size columns.

Ok now we have our feature matrix X with which we want to train the data. Let's assume the best fitting model is determined by some maximum likelihood procedure, minimising the $\chi^2$
\[
\chi^2 = \sum_{ij} (x_i - y_i)C_{ij}^{-1} (x_j - y_j).
\]As you can see to calculate this $\chi^2$ we need the covariance matrix $C_{ij}$, which is defined as
\[
C_{ij} = \frac{1}{N}\sum^N_k(x^k_i - \mu_i)(x^k_j - \mu_j),
\]where $N$ is the number of rows in our table, $x_i^k$ is the $i$-th element in row number $k$ and $\mu_j$ is the mean value of all entries in column $j$.

Given that we already transformed our data by subtracting the mean
\[
x_i \rightarrow x_i - \mu_i,
\]we can write the equation for the covariance matrix simply as
\[
C = \frac{1}{N}X^TX,
\]where $X$ is the matrix representation of our table above, with $6$ rows and $4$ columns.

The covariance matrix can be calculated quite easily using numpy:

C = np.cov(X, rowvar=False)

To calculate $\chi^2$ we need the inverse of that covariance matrix. A matrix can only be inverted if it is non-singular, or in other words the rank of the matrix needs to be equal to its dimension:

from numpy.linalg import matrix_rank
print "rank of C = %d, while shape is %s" % (matrix_rank(C), C.shape)

which gives the output

>> rank of C = 3, while shape is (4, 4)

As you might have expected the $4\times 4$ covariance matrix has only rank $3$ and hence cannot be inverted. The reason why this happened is that our original feature data has two columns that provide identical information. The size in m$^2$ and the size in feet$^2$ represent the same property of the house and are $100\%$ correlated.

Another way of saying this is that the rank of a matrix is determined by the number of orthogonal eigenvectors which one can find for the matrix. Given that in our matrix two columns are just scaled versions of each other, they can be described by the same eigenvector.

For that reason it is important that we only include independent features in our data vector. However, usually the picture is not quite as black and white as shown above. What if the correlation between two features is not quite $100\%$ but $99\%$ or $90\%$? Should we include the feature or exclude it? What if we have $10\,000$ features? Do we really want to go through all of them to check whether they are correlated?

This is where Principal Component Analysis (PCA) can help. PCA allows to detect the components with the largest variance in the data. PCA thus can have the effect of concentrating much of the signal into the first few principal components. Reducing your analysis to those components can make life much easier.

This always has to be balanced against the potential loss of information when removing certain components. We can calculate the principal components using Singular Value Decomposition (SVD). Our covariance matrix $C$ is quadratic, with 4 rows and 4 columns. In the quadratic case SVD is often called eigenvalue decomposition. It can be shown that $C$ can be represented by two other matrices $V$ and $S$ as:
\[
C = VSV^T,
\]where $V$ is a matrix containing the eigenvectors of $C$, and $S$ is a diagonal matrix containing the eigenvalues of $C$. The eigenvectors describe the direction of the principal component and represent an orthogonal base. Projections of the data on the principal axes are called principal components.

The $j$ principal component is given by the $j$-th row of $XV$. SVD of the original data matrix is given by
\[
X = USV^T
\]and that implies for the covariance matrix
\[
C = XX^T = \frac{VSU^TUSV^T}{(N-1)} = \frac{VS^2V^T}{(N-1)}.
\]The principal components can be calculated as $XV = USV^TV = US$

U, s, V = np.linalg.svd(X, full_matrices=False)
S = np.diag(s)
print "US = ", U.dot(S)

which gives

>> US =  [[  4.64105174e-01  -4.58797902e-02   1.87296123e-02   4.84869440e-16]

 [  1.29659668e-01  -8.97596094e-02  -3.32400592e-02   7.15735005e-16]

 [ -5.29410729e-01   1.69369612e-01   4.41330595e-03   5.95354396e-16]

 [ -5.86630610e-01  -1.09955994e-01   6.54082083e-02   5.51960575e-16]

 [ -1.21483834e-01  -7.94351556e-02   8.68858184e-02   4.53123083e-16]

 [ -6.67336389e-01   1.94703324e-01  -3.17861692e-02   4.41462214e-16]

 [ -9.16426053e-02  -3.07119687e-01   1.41294505e-01   5.11747418e-16]

 [  6.16479613e-01  -9.51629654e-02   1.88312161e-01   5.36814447e-16]

 [ -7.86901776e-01  -5.34895939e-01  -1.07054260e-01   4.73497220e-16]

 [  1.07328864e-01   4.91088137e-01  -1.09638239e-01   5.12658847e-16]

 [  6.01525587e-01  -2.54710265e-01   1.84835593e-01   4.60299469e-16]

 [  6.46184026e-01  -3.11006408e-01  -4.34257301e-01   4.94555877e-16]

 [  3.52217810e-02   5.98994362e-01  -9.95754561e-02   4.87155668e-16]

 [  1.82901232e-01   3.73770379e-01   1.25672280e-01   4.96024184e-16]]

To reduce our number of components from 4 to 3 we just have to pick the 3 first columns of $U$ and the $3\times 3$ upper left part of $S$. Their product $X_3 = U_3S_3$ is the required $6\times 3$ matrix containing the $3$ principal components with the highest variance

k = 3
print "k = %d" % k
print "US_k = ", U[:, 0:k].dot(S[0:k, 0:k])

>> US_k =  [[ 0.46410517 -0.04587979  0.01872961]

 [ 0.12965967 -0.08975961 -0.03324006]

 [-0.52941073  0.16936961  0.00441331]

 [-0.58663061 -0.10995599  0.06540821]

 [-0.12148383 -0.07943516  0.08688582]

 [-0.66733639  0.19470332 -0.03178617]

 [-0.09164261 -0.30711969  0.14129451]

 [ 0.61647961 -0.09516297  0.18831216]

 [-0.78690178 -0.53489594 -0.10705426]

 [ 0.10732886  0.49108814 -0.10963824]

 [ 0.60152559 -0.25471026  0.18483559]

 [ 0.64618403 -0.31100641 -0.4342573 ]

 [ 0.03522178  0.59899436 -0.09957546]

 [ 0.18290123  0.37377038  0.12567228]]

But how can we now quantify the amount of information loss due to the removed components?

$X_k$ is going to be an approximation to $X$ and since $X$ contains all the information we want to keep, the difference between $X$ and $X_k$ should be small. We can get the retained variance in the first $k$ principal components directly from the Eigenvalues which the SVD gave us

The Eigenvalues show variances of the respective PCs and hence we can calculate
\[
\frac{\sum^k_i s_{ii}}{\sum^N_i s_{ii}},
\] which gives the ratio of the retained variance in the first k principal components.

print("the retained variance is = ", sum(s[:k])/sum(s))

and with $k=3$ we find

>> the retained variance is =  1.0

In our case we do not lose any information, since the feature we removed did not have any useful information. If we were to exclude the two smallest principles component ($k=2$) we would get

k = 2
print("k = %d" % k)

print("US_k = ", U[:, 0:k].dot(S[0:k, 0:k]))

print("the retained variance is = ", sum(s[:k])/sum(s))

>> k = 2

US_k =  [[ 0.46410517 -0.04587979]

 [ 0.12965967 -0.08975961]

 [-0.52941073  0.16936961]

 [-0.58663061 -0.10995599]

 [-0.12148383 -0.07943516]

 [-0.66733639  0.19470332]

 [-0.09164261 -0.30711969]

 [ 0.61647961 -0.09516297]

 [-0.78690178 -0.53489594]

 [ 0.10732886  0.49108814]

 [ 0.60152559 -0.25471026]

 [ 0.64618403 -0.31100641]

 [ 0.03522178  0.59899436]

 [ 0.18290123  0.37377038]]

the retained variance is = 0.834720270224

and hence we would lose a significant amount of information.

It makes sense to remove features if they add only a small amount of information, but generally you want to keep this fraction high e.g. $0.9$, which means that the $k$ highest principal components you keep should contain $90\%$ of the information.

I posted the entire Python code on Github. Let me know if you have comments/questions below.
cheers
Florian