Ok, let's assume we have a table of data on properties, with the age of the house, the size in m$^2$, the size in feet$^2$ (for all the Americans out there) and the price:
Age [years] | size [in m$^{2}$] | size [in feet$^{2}$] | price [in US$] |
50 | 80 | 861 | 1 000 000 |
45 | 95 | 1023 | 850 000 |
5 | 115 | 1238 | 400 000 |
12 | 125 | 1345 | 600 000 |
25 | 105 | 1130 | 800 000 |
2 | 120 | 1292 | 300 000 |
33 | 110 | 1184 | 1 000 000 |
43 | 75 | 807 | 1 200 000 |
42 | 145 | 1561 | 700 000 |
21 | 80 | 861 | 400 000 |
51 | 80 | 861 | 1 300 000 |
110 | 80 | 861 | 1 000 000 |
12 | 80 | 861 | 300 000 |
9 | 80 | 861 | 650 000 |
Of course this dataset is too small to learn anything useful, but it's big enough for me to show the use of SVD and PCA.
Assume we use the table above to train a machine learning model with the hope that we can predict prices for other houses given the age and size. That would be quite useful to have if you want to buy a house in the same area as the houses in the table, or if you run a real estate business.
Let's use all columns above as features to train our model. We start preparing the data:
X = np.matrix([[50., 80., 861., 1000000], [45., 95., 1023., 850000], [5., 115., 1238., 400000], [12., 125., 1345., 600000], [25., 105., 1130., 800000], [2., 120., 1292., 300000], [33., 110., 1184., 1000000], [43., 75., 807., 1200000], [42., 145., 1561., 700000], [21., 80., 861., 400000], [51., 80., 861., 1300000], [110., 80., 861., 1000000], [12., 80., 861., 300000], [9., 80., 861., 650000]]) # Lets subtract the mean of each column and divide the by range (feature scaling) for col in range(0,X.shape[1]): X[:,col] = ( X[:,col] - np.mean(X[:,col]) )/( np.max(X[:,col]) - np.min(X[:,col]) )
Ok now we have our feature matrix X with which we want to train the data. Let's assume the best fitting model is determined by some maximum likelihood procedure, minimising the $\chi^2$
\[
\chi^2 = \sum_{ij} (x_i - y_i)C_{ij}^{-1} (x_j - y_j).
\]As you can see to calculate this $\chi^2$ we need the covariance matrix $C_{ij}$, which is defined as
\[
C_{ij} = \frac{1}{N}\sum^N_k(x^k_i - \mu_i)(x^k_j - \mu_j),
\]where $N$ is the number of rows in our table, $x_i^k$ is the $i$-th element in row number $k$ and $\mu_j$ is the mean value of all entries in column $j$.
Given that we already transformed our data by subtracting the mean
\[
x_i \rightarrow x_i - \mu_i,
\]we can write the equation for the covariance matrix simply as
\[
C = \frac{1}{N}X^TX,
\]where $X$ is the matrix representation of our table above, with $6$ rows and $4$ columns.
The covariance matrix can be calculated quite easily using numpy:
C = np.cov(X, rowvar=False)
from numpy.linalg import matrix_rank print "rank of C = %d, while shape is %s" % (matrix_rank(C), C.shape)
>> rank of C = 3, while shape is (4, 4)
As you might have expected the $4\times 4$ covariance matrix has only rank $3$ and hence cannot be inverted. The reason why this happened is that our original feature data has two columns that provide identical information. The size in m$^2$ and the size in feet$^2$ represent the same property of the house and are $100\%$ correlated.
Another way of saying this is that the rank of a matrix is determined by the number of orthogonal eigenvectors which one can find for the matrix. Given that in our matrix two columns are just scaled versions of each other, they can be described by the same eigenvector.
For that reason it is important that we only include independent features in our data vector. However, usually the picture is not quite as black and white as shown above. What if the correlation between two features is not quite $100\%$ but $99\%$ or $90\%$? Should we include the feature or exclude it? What if we have $10\,000$ features? Do we really want to go through all of them to check whether they are correlated?
This is where Principal Component Analysis (PCA) can help. PCA allows to detect the components with the largest variance in the data. PCA thus can have the effect of concentrating much of the signal into the first few principal components. Reducing your analysis to those components can make life much easier.
This always has to be balanced against the potential loss of information when removing certain components. We can calculate the principal components using Singular Value Decomposition (SVD). Our covariance matrix $C$ is quadratic, with 4 rows and 4 columns. In the quadratic case SVD is often called eigenvalue decomposition. It can be shown that $C$ can be represented by two other matrices $V$ and $S$ as:
\[
C = VSV^T,
\]where $V$ is a matrix containing the eigenvectors of $C$, and $S$ is a diagonal matrix containing the eigenvalues of $C$. The eigenvectors describe the direction of the principal component and represent an orthogonal base. Projections of the data on the principal axes are called principal components.
The $j$ principal component is given by the $j$-th row of $XV$. SVD of the original data matrix is given by
\[
X = USV^T
\]and that implies for the covariance matrix
\[
C = XX^T = \frac{VSU^TUSV^T}{(N-1)} = \frac{VS^2V^T}{(N-1)}.
\]The principal components can be calculated as $XV = USV^TV = US$
U, s, V = np.linalg.svd(X, full_matrices=False) S = np.diag(s) print "US = ", U.dot(S)
>> US = [[ 4.64105174e-01 -4.58797902e-02 1.87296123e-02 4.84869440e-16]
[ 1.29659668e-01 -8.97596094e-02 -3.32400592e-02 7.15735005e-16]
[ -5.29410729e-01 1.69369612e-01 4.41330595e-03 5.95354396e-16]
[ -5.86630610e-01 -1.09955994e-01 6.54082083e-02 5.51960575e-16]
[ -1.21483834e-01 -7.94351556e-02 8.68858184e-02 4.53123083e-16]
[ -6.67336389e-01 1.94703324e-01 -3.17861692e-02 4.41462214e-16]
[ -9.16426053e-02 -3.07119687e-01 1.41294505e-01 5.11747418e-16]
[ 6.16479613e-01 -9.51629654e-02 1.88312161e-01 5.36814447e-16]
[ -7.86901776e-01 -5.34895939e-01 -1.07054260e-01 4.73497220e-16]
[ 1.07328864e-01 4.91088137e-01 -1.09638239e-01 5.12658847e-16]
[ 6.01525587e-01 -2.54710265e-01 1.84835593e-01 4.60299469e-16]
[ 6.46184026e-01 -3.11006408e-01 -4.34257301e-01 4.94555877e-16]
[ 3.52217810e-02 5.98994362e-01 -9.95754561e-02 4.87155668e-16]
[ 1.82901232e-01 3.73770379e-01 1.25672280e-01 4.96024184e-16]]
To reduce our number of components from 4 to 3 we just have to pick the 3 first columns of $U$ and the $3\times 3$ upper left part of $S$. Their product $X_3 = U_3S_3$ is the required $6\times 3$ matrix containing the $3$ principal components with the highest variance
k = 3 print "k = %d" % k print "US_k = ", U[:, 0:k].dot(S[0:k, 0:k])
>> US_k = [[ 0.46410517 -0.04587979 0.01872961]
[ 0.12965967 -0.08975961 -0.03324006]
[-0.52941073 0.16936961 0.00441331]
[-0.58663061 -0.10995599 0.06540821]
[-0.12148383 -0.07943516 0.08688582]
[-0.66733639 0.19470332 -0.03178617]
[-0.09164261 -0.30711969 0.14129451]
[ 0.61647961 -0.09516297 0.18831216]
[-0.78690178 -0.53489594 -0.10705426]
[ 0.10732886 0.49108814 -0.10963824]
[ 0.60152559 -0.25471026 0.18483559]
[ 0.64618403 -0.31100641 -0.4342573 ]
[ 0.03522178 0.59899436 -0.09957546]
[ 0.18290123 0.37377038 0.12567228]]
But how can we now quantify the amount of information loss due to the removed components?
The Eigenvalues show variances of the respective PCs and hence we can calculate
\[
\frac{\sum^k_i s_{ii}}{\sum^N_i s_{ii}},
\] which gives the ratio of the retained variance in the first k principal components.
print("the retained variance is = ", sum(s[:k])/sum(s))
>> the retained variance is = 1.0
In our case we do not lose any information, since the feature we removed did not have any useful information. If we were to exclude the two smallest principles component ($k=2$) we would get
k = 2 print("k = %d" % k)
print("US_k = ", U[:, 0:k].dot(S[0:k, 0:k]))
print("the retained variance is = ", sum(s[:k])/sum(s))
>> k = 2
US_k = [[ 0.46410517 -0.04587979]
[ 0.12965967 -0.08975961]
[-0.52941073 0.16936961]
[-0.58663061 -0.10995599]
[-0.12148383 -0.07943516]
[-0.66733639 0.19470332]
[-0.09164261 -0.30711969]
[ 0.61647961 -0.09516297]
[-0.78690178 -0.53489594]
[ 0.10732886 0.49108814]
[ 0.60152559 -0.25471026]
[ 0.64618403 -0.31100641]
[ 0.03522178 0.59899436]
[ 0.18290123 0.37377038]]
and hence we would lose a significant amount of information.
It makes sense to remove features if they add only a small amount of information, but generally you want to keep this fraction high e.g. $0.9$, which means that the $k$ highest principal components you keep should contain $90\%$ of the information.
I posted the entire Python code on Github. Let me know if you have comments/questions below.
cheers
Florian