In the real world, we analyze highly complex data i.e. multi-dimensional data.
Sometimes, we plot the data and find various patterns in it or use it to train some machine learning models. One way to think about dimensions is that suppose you have a data point x if we consider this data point as a physical object then dimensions are merely a basis of view, like where is the data located when it is observed from the horizontal axis or vertical axis.
As the dimensions of data increases, the difficulty we face is to visualize it and perform computations on it as with increasing the dimension of the data computation also increases drastically.
So, what is the optimal way to reduce the dimensions of data:-
- Remove the redundant dimensions
- Only keep the most important dimensions
Hence, before started working with PCA(Principal Component Analysis) let us clear some basic concepts.
- Variance: It is a measure of how spread the data set is.
Mathematically, it is the average squared deviation from the mean score. We use the following formula to compute variance var(x).
2. Covariance: It is a measure of the extent to which corresponding elements from two sets of ordered data move in the same direction.
Here, xi is the value of x in ith dimension. x bar and y bar denote the corresponding mean values.
A positive covariance means X and Y are positively related to each other i.e. as X increases Y also increases. Negative covariance depicts the exact opposite relation. However, zero covariance means X and Y are not related in any manner.
What is Principal Component Analysis?
Now let’s think about the requirement of data analysis. We always try to find the patterns among the data sets so we want the data to be spread out across each dimension. Also, we want the dimensions to be independent of each other. Such that if data has high covariance when represented in some n number of dimensions then we replace those dimensions with a linear combination of those n dimensions. Thus now the data will only be dependent on a linear combination of those related n dimensions.
PCA is a technique that finds a new set of dimensions (or a set of the basis of views) such that all the dimensions are orthogonal (perpendicular to each other and hence linearly independent) and ranked according to the variance of data among them.
It means more important principle axis occurs first.
(high importance = high variance/more spread out data)
Working of PCA
- Calculate the covariance matrix represented as X of data points.
- Calculate Eigen vectors and corresponding Eigen values.
- Sort the Eigenvectors according to their Eigenvalues in decreasing order.
- Choose first k Eigenvectors and that will be the new k dimensions.
- Transform the original n-dimensional data points into k dimensions.
We’ll see some principal component analysis examples to get more understanding of Eigen Values and Vectors kindly look at the video below.
Assuming we have the knowledge of variance and covariance, let’s look into what a Covariance matrix is.
A covariance matrix of some data set in 4 dimensions a, b, c, d.
Va : variance along dimension a
Ca, b : Covariance along dimension a and b
If we have a matrix X of m*n dimension such that it holds n data points of m dimensions, then covariance matrix can be calculated as
It is important to note that the covariance matrix contains –
- The variance of dimensions as the main diagonal elements.
- The covariance of dimensions as the off-diagonal elements (not diagonal elements).
In this principal component analysis tutorial, as discussed earlier we want the data to be spread out i.e. it should have high variance along dimensions. Similarly, we want to remove correlated dimensions i.e. covariance among the dimensions should be zero (they should be linearly independent). Therefore, our covariance matrix should have –
- Large numbers as the main diagonal elements.
- Zero values as the off-diagonal elements.
Hence, we call it a diagonal matrix.
Hence, as per PCA
- Find linearly independent dimensions (or basis of views) which can represent the data points with no loss.
- Those newly found dimensions should allow us to predict/reconstruct the original dimensions. The reconstruction/projection error should be minimized.
Let’s try to understand what I mean by projection error. Suppose we have to transform a 2-dimensional representation of data points to a one-dimensional representation. So we will basically try to find a straight line and project data points on them
There are many possibilities to select a straight line. Let’s see two such possibilities: –
Hence, we can conclude that the magenta line will be our new dimension.
If you see the red lines (connecting the projection of blue points on a magenta line) i.e. the perpendicular distance of each data point from the straight line is the projection error. The sum of the error of all data points will be defined as the total projection error in our case.
Our new data points will be the projections (red points) of those original blue data points. As we can see we have transformed 2-dimensional data points to 1-dimensional data points by projection them on 1-dimensional space i.e. a straight line.
That magenta straight line is called the principal axis. Since we are projecting to a single dimension, we have only one principal axis.
The benefits of Principal axis is that:-
- The projection error is less than that in the first case.
- Newly projected red points are more widely spread out than the first case. i.e. more variance.