My statistics blog: April 2017

In this blog post I will discuss how to use Singular Value Decomposition (SVD) and Principle Component Analysis (PCA) to determine the number of useful features in a feature vector. Whenever tackling a machine learning problem with a large number of available features, one should do a PCA as the first step, since in many cases reducing the dimensions of the problem can speed up the project significantly.

Ok, let's assume we have a table of data on properties, with the age of the house, the size in m$^2$, the size in feet$^2$ (for all the Americans out there) and the price:

Age [years]	size [in m$^{2}$]	size [in feet$^{2}$]	price [in US$]
50	80	861	1 000 000
45	95	1023	850 000
5	115	1238	400 000
12	125	1345	600 000
25	105	1130	800 000
2	120	1292	300 000
33	110	1184	1 000 000
43	75	807	1 200 000
42	145	1561	700 000
21	80	861	400 000
51	80	861	1 300 000
110	80	861	1 000 000
12	80	861	300 000
9	80	861	650 000

Of course this dataset is too small to learn anything useful, but it's big enough for me to show the use of SVD and PCA.

Assume we use the table above to train a machine learning model with the hope that we can predict prices for other houses given the age and size. That would be quite useful to have if you want to buy a house in the same area as the houses in the table, or if you run a real estate business.

Let's use all columns above as features to train our model. We start preparing the data:

X = np.matrix([[50., 80., 861., 1000000], 
               [45., 95., 1023., 850000], 
               [5., 115., 1238., 400000],
               [12., 125., 1345., 600000],
               [25., 105., 1130., 800000],
               [2., 120., 1292., 300000],
               [33., 110., 1184., 1000000],
               [43., 75., 807., 1200000],
               [42., 145., 1561., 700000],
               [21., 80., 861., 400000],
               [51., 80., 861., 1300000],
               [110., 80., 861., 1000000],
               [12., 80., 861., 300000],
               [9., 80., 861., 650000]])
# Lets subtract the mean of each column and divide the by range (feature scaling)
for col in range(0,X.shape[1]):
  X[:,col] = ( X[:,col] - np.mean(X[:,col]) )/( np.max(X[:,col]) - np.min(X[:,col]) )

Here we subtracted the mean of each column and divided by the range. This is important, since if we do not do this the price is dominating the variance, since it has in absolute terms much larger values than the age or size columns.

Ok now we have our feature matrix X with which we want to train the data. Let's assume the best fitting model is determined by some maximum likelihood procedure, minimising the $\chi^2$
\[
\chi^2 = \sum_{ij} (x_i - y_i)C_{ij}^{-1} (x_j - y_j).
\]As you can see to calculate this $\chi^2$ we need the covariance matrix $C_{ij}$, which is defined as
\[
C_{ij} = \frac{1}{N}\sum^N_k(x^k_i - \mu_i)(x^k_j - \mu_j),
\]where $N$ is the number of rows in our table, $x_i^k$ is the $i$-th element in row number $k$ and $\mu_j$ is the mean value of all entries in column $j$.

Given that we already transformed our data by subtracting the mean
\[
x_i \rightarrow x_i - \mu_i,
\]we can write the equation for the covariance matrix simply as
\[
C = \frac{1}{N}X^TX,
\]where $X$ is the matrix representation of our table above, with $6$ rows and $4$ columns.

The covariance matrix can be calculated quite easily using numpy:

C = np.cov(X, rowvar=False)

To calculate $\chi^2$ we need the inverse of that covariance matrix. A matrix can only be inverted if it is non-singular, or in other words the rank of the matrix needs to be equal to its dimension:

from numpy.linalg import matrix_rank
print "rank of C = %d, while shape is %s" % (matrix_rank(C), C.shape)

which gives the output

>> rank of C = 3, while shape is (4, 4)

As you might have expected the $4\times 4$ covariance matrix has only rank $3$ and hence cannot be inverted. The reason why this happened is that our original feature data has two columns that provide identical information. The size in m$^2$ and the size in feet$^2$ represent the same property of the house and are $100\%$ correlated.

Another way of saying this is that the rank of a matrix is determined by the number of orthogonal eigenvectors which one can find for the matrix. Given that in our matrix two columns are just scaled versions of each other, they can be described by the same eigenvector.

For that reason it is important that we only include independent features in our data vector. However, usually the picture is not quite as black and white as shown above. What if the correlation between two features is not quite $100\%$ but $99\%$ or $90\%$? Should we include the feature or exclude it? What if we have $10\,000$ features? Do we really want to go through all of them to check whether they are correlated?

This is where Principal Component Analysis (PCA) can help. PCA allows to detect the components with the largest variance in the data. PCA thus can have the effect of concentrating much of the signal into the first few principal components. Reducing your analysis to those components can make life much easier.

This always has to be balanced against the potential loss of information when removing certain components. We can calculate the principal components using Singular Value Decomposition (SVD). Our covariance matrix $C$ is quadratic, with 4 rows and 4 columns. In the quadratic case SVD is often called eigenvalue decomposition. It can be shown that $C$ can be represented by two other matrices $V$ and $S$ as:
\[
C = VSV^T,
\]where $V$ is a matrix containing the eigenvectors of $C$, and $S$ is a diagonal matrix containing the eigenvalues of $C$. The eigenvectors describe the direction of the principal component and represent an orthogonal base. Projections of the data on the principal axes are called principal components.

The $j$ principal component is given by the $j$-th row of $XV$. SVD of the original data matrix is given by
\[
X = USV^T
\]and that implies for the covariance matrix
\[
C = XX^T = \frac{VSU^TUSV^T}{(N-1)} = \frac{VS^2V^T}{(N-1)}.
\]The principal components can be calculated as $XV = USV^TV = US$

U, s, V = np.linalg.svd(X, full_matrices=False)
S = np.diag(s)
print "US = ", U.dot(S)

which gives

>> US =  [[  4.64105174e-01  -4.58797902e-02   1.87296123e-02   4.84869440e-16]

 [  1.29659668e-01  -8.97596094e-02  -3.32400592e-02   7.15735005e-16]

 [ -5.29410729e-01   1.69369612e-01   4.41330595e-03   5.95354396e-16]

 [ -5.86630610e-01  -1.09955994e-01   6.54082083e-02   5.51960575e-16]

 [ -1.21483834e-01  -7.94351556e-02   8.68858184e-02   4.53123083e-16]

 [ -6.67336389e-01   1.94703324e-01  -3.17861692e-02   4.41462214e-16]

 [ -9.16426053e-02  -3.07119687e-01   1.41294505e-01   5.11747418e-16]

 [  6.16479613e-01  -9.51629654e-02   1.88312161e-01   5.36814447e-16]

 [ -7.86901776e-01  -5.34895939e-01  -1.07054260e-01   4.73497220e-16]

 [  1.07328864e-01   4.91088137e-01  -1.09638239e-01   5.12658847e-16]

 [  6.01525587e-01  -2.54710265e-01   1.84835593e-01   4.60299469e-16]

 [  6.46184026e-01  -3.11006408e-01  -4.34257301e-01   4.94555877e-16]

 [  3.52217810e-02   5.98994362e-01  -9.95754561e-02   4.87155668e-16]

 [  1.82901232e-01   3.73770379e-01   1.25672280e-01   4.96024184e-16]]

To reduce our number of components from 4 to 3 we just have to pick the 3 first columns of $U$ and the $3\times 3$ upper left part of $S$. Their product $X_3 = U_3S_3$ is the required $6\times 3$ matrix containing the $3$ principal components with the highest variance

k = 3
print "k = %d" % k
print "US_k = ", U[:, 0:k].dot(S[0:k, 0:k])

>> US_k =  [[ 0.46410517 -0.04587979  0.01872961]

 [ 0.12965967 -0.08975961 -0.03324006]

 [-0.52941073  0.16936961  0.00441331]

 [-0.58663061 -0.10995599  0.06540821]

 [-0.12148383 -0.07943516  0.08688582]

 [-0.66733639  0.19470332 -0.03178617]

 [-0.09164261 -0.30711969  0.14129451]

 [ 0.61647961 -0.09516297  0.18831216]

 [-0.78690178 -0.53489594 -0.10705426]

 [ 0.10732886  0.49108814 -0.10963824]

 [ 0.60152559 -0.25471026  0.18483559]

 [ 0.64618403 -0.31100641 -0.4342573 ]

 [ 0.03522178  0.59899436 -0.09957546]

 [ 0.18290123  0.37377038  0.12567228]]

But how can we now quantify the amount of information loss due to the removed components?

$X_k$ is going to be an approximation to $X$ and since $X$ contains all the information we want to keep, the difference between $X$ and $X_k$ should be small. We can get the retained variance in the first $k$ principal components directly from the Eigenvalues which the SVD gave us

The Eigenvalues show variances of the respective PCs and hence we can calculate
\[
\frac{\sum^k_i s_{ii}}{\sum^N_i s_{ii}},
\] which gives the ratio of the retained variance in the first k principal components.

print("the retained variance is = ", sum(s[:k])/sum(s))

and with $k=3$ we find

>> the retained variance is =  1.0

In our case we do not lose any information, since the feature we removed did not have any useful information. If we were to exclude the two smallest principles component ($k=2$) we would get

k = 2
print("k = %d" % k)

print("US_k = ", U[:, 0:k].dot(S[0:k, 0:k]))

print("the retained variance is = ", sum(s[:k])/sum(s))

>> k = 2

US_k =  [[ 0.46410517 -0.04587979]

 [ 0.12965967 -0.08975961]

 [-0.52941073  0.16936961]

 [-0.58663061 -0.10995599]

 [-0.12148383 -0.07943516]

 [-0.66733639  0.19470332]

 [-0.09164261 -0.30711969]

 [ 0.61647961 -0.09516297]

 [-0.78690178 -0.53489594]

 [ 0.10732886  0.49108814]

 [ 0.60152559 -0.25471026]

 [ 0.64618403 -0.31100641]

 [ 0.03522178  0.59899436]

 [ 0.18290123  0.37377038]]

the retained variance is = 0.834720270224

and hence we would lose a significant amount of information.

It makes sense to remove features if they add only a small amount of information, but generally you want to keep this fraction high e.g. $0.9$, which means that the $k$ highest principal components you keep should contain $90\%$ of the information.

I posted the entire Python code on Github. Let me know if you have comments/questions below.
cheers
Florian

My statistics blog

Monday, April 10, 2017

Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) -- finding useful features

But how can we now quantify the amount of information loss due to the removed components?