Principal Component Analysis with Illustration in Python

Binil
4 min readNov 7, 2018

PCA or principal component analysis is a key technique used in Machine Learning. It is used for dimensionality reduction. Many times you will have to deal with lot many features or dimensions for a machine learning challenge. In most of the cases there would be so many features and you don’t know these features, which are important, which are not so important etc. PCA is a scientific way of finding which are the important components(fewer dimensions) to be considered for Machine Learning, thereby help in reducing the curse of dimensionality.

There will be very minimal reduction of useful information while reducing dimensions through PCA.

Some of the problems with high dimensionality are the following:

1. Not all features would be useful for prediction. Some would be more useful than others. More features result in more computing and processing time.

2. More features would result in model failing to generalize a given problem. This could result in overfitting.

Hence for large dimensional data sets it is vital that we reduce the complexity of the problem by feature reduction techniques like PCA. PCA eliminates collinearity by eliminating redundant features. Hence the resulting features would be independent of each other.

PCA results in a new set of dimensions or axes. Hence interpretability could be an issue.

Working of PCA:

The data is normally spread across different dimensions. Each dimension can be considered as an axes in a coordinate system. PCA looks for new axes where the data spread is the maximum.

That would become the first principal component. Subsequent principal axes is determined by maximum variation which is orthogonal to the first principal axes. Each axis would be orthogonal to each other and hence they would be uncorrelated.

PCA is working based on the mathematical concept of Eigenvalues and Eigenvectors. Further reading on Eigenvalues and Eigenvectors is in the link below

http://www.math.union.edu/~jaureguj/PCA.pdf

Selecting the number of components to be selected:

Either we can arbitrarily fix a number for the number of principle components to be selected or we can make it more scientific by knowing the percentage of variance explained by each of the principal components. We can put a threshold for the percentage of variance to be explained by PCA and select the number of components accordingly. The method is explained in the code below.

PCA implementation in Python:

Let us use the sklearn inbuilt Boston data set for PCA illustration

‘’’ #Loading required librariesimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.decomposition import PCA
#Loading the data frame
boston = load_boston()
#creating a Python dataframe from the data set
df = pd.DataFrame(boston[‘data’])
df.shape

(506,13) #Data Frame has 506 rows and 13 columns or features

df.head()
DataFrame head
#we need to standardize the scale of the dataframe as different #variables would be in different scalesfrom sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df)
scaled_data = scaler.transform(df)
pd.DataFrame(scaled_data.head()).head()
scaled data sample
#Applying PCA in the scaled data. First we are applying n_components to #max components to understand the relative importance of each of each of #the transformed axes corresponding to the 13 input variablespca = PCA(n_components = 13)
pca.fit(scaled_data)
x_pca = pca.transform(scaled_data)
pd.DataFrame(x_pca).head()
Transformed data sample after applying PCA
#Understanding variations explained by each axesvar = pca.explained_variance_ration_
var1 = np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4 )*100)
plt.plot(var1)
Cumulative Variance explained by each of the 13 axes/components
var
Proportion of variance explained by each of the components in descending order
var1 
Cumulative variance by component

Once we understand the variance explained by each of the transformed axes, we can decide on how many components to be considered for our modelling.

For ex, first principal component alone explains 47% of the variance in the data. First 5 components together account for 80% variance in the data. Hence based on how much variance needs to be explained, we can select the number of features.

Hope this article is helpful. Post your comments in the comments section.

--

--

Binil

I am a Data Science practitioner. I have good experience in building and deploying ML models and using ML for Operations Research. My Email: binil_kg@yahoo.com