You may not know this about PCA
In this article, we are going contrast the two-dimensionality reduction techniques that are Principal Component Analysis (PCA), and Linear Discriminant Analysis (LDA)
By: Daksh Bhatnagar
As time went on, the size of the data has grown substantially. Businesses nowadays want to consider each and every aspect before making a decision which translates to higher dimensionality in real-life data.
A higher dimensionality data is usually considered bad in data science and has been given the name of Curse of Dimensionality. The curse of Dimensionality refers to a set of problems that arise when working with high-dimensional data.
Some of the difficulties that come with high dimensional data manifest during analyzing or visualizing the data to identify patterns, and some manifest while training machine learning models.
The difficulties related to training machine learning models due to high dimensional data are referred to as the ‘Curse of Dimensionality’.
Luckily, for us, there exists a concept of Dimensionality reduction which works upon the very idea of reducing dimensions to the point the independent features add value to the predictive model but are not too much for the model to be lost and not make predictions or make terrible predictions.
Principal Component Analysis
Principal Component Analysis is a way to reduce the number of variables while maintaining the majority of the important information. It transforms a number of variables that may be correlated into a smaller number of uncorrelated variables, known as principal components.
The main objective of PCA is to simplify your model features into fewer components to help visualize patterns in your data and to help your model run faster.
Using PCA reduces the chance of overfitting your model by eliminating features with high correlation.
What’s happening under the hood here is that the algorithm finds out the covariance matrix first and then Eigenvalues and eigenvectors are calculated. PCA projects the data onto an axis and selects those points that have high variance because the higher the spread of the data, the more we can explain using that data.
Eigenvalues and eigenvectors always come in pairs, so every eigenvector has an eigenvalue and their number is equal to the number of dimensions of the data.
There is something known as Explained Variance which is a concept that tells you how much variance is explained by which feature of the data. Ideally, you want to achieve at least 90% of the variance meaning you would want your data to explain 90% of the variation that’s going in your data.
Sometimes 1 feature could explain 98% variance however at other times, 1 feature could also explain 50% of the variance in the data which is exactly where the explained variance plot comes in handy.
In the chart above, we can see 1 feature is only explaining 53% of the data while 6 features (out of 30 features) explain 91% of the data.
When we plot out the first four eigenvectors after the linear transformation, here is what it looks like:-
In n-dimensional space, what’s happening is that an imaginary axis is chosen and then the data points are projected onto the axis and whichever axis has the most amount of variance is eventually selected, and then the other data points are also transformed by the multiplication of those top-n eigenvectors.
We also plotted the first vector before and after the linear transformation. The light blue is the vector before the transformation and the dark blue is the vector after the transformation.
The libraries like
numpy has made it very easy for us to transform the data and visualize it. You can use the code below to get the eigenvectors and eigenvalues.
import numpy as np
import pandas as pd
#Extracting input values
column_values = 
for i in range(len(inputs_df.columns)):
#Making Covariance Matrix
covariance_matrix = np.cov(column_values)
#Getting the EigenVectors and the EigenValues
eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)
#selecting the top n eigen vectors
pc = eigen_vectors[0:6]
#transforming the other data points
transformed_df = np.dot(df.iloc[:,0:30],pc.T)
new_df = pd.DataFrame(transformed_df,
columns=['PC1','PC2','PC3', 'PC4', 'PC5', 'PC6'])
The above implementation was using the
numpylibrary. You can also use the code below which uses
scikit-learn for the same purpose.
from sklearn import decomposition
pca = decomposition.PCA(0.90)
X_transformed = pca.fit_transform(X)
PCA tries to put the maximum possible information in the first component, then the maximum remaining information in the second, and so on which means the last eigenvectors or the last component will have the least variance.
The final plot of the first two Principal Components would look something like this:-
LINEAR DISCRIMINANT ANALYSIS
Linear Discriminant Analysis, or LDA for short, is a predictive modeling algorithm for multi-class classification. It is also used as a dimensionality reduction technique, providing a projection of a training dataset that best separates the examples by their assigned class.
Linear Discriminant Analysis is used to find a linear combination of features that characterizes or separates two or more classes of objects or events. It explicitly attempts to model the difference between the classes of data.
Drawbacks of Linear Discriminant Analysis (LDA)
Although LDA is specifically used to solve supervised classification problems for two or more classes, it fails in some cases where the Mean of the distributions is shared. In this case, LDA fails to create a new axis that makes both classes linearly separable.
To overcome such problems, we use non-linear Discriminant analysis in machine learning.
Two criteria are used by LDA to create a new axis:
- Maximize the distance between the means of the two classes.
- Minimize the variation within each class.
The cost function in LDA is the formula shown below. The numerator in the image below should be maximum since it would imply that the means are far apart and the classes can be separated well however the denominator should be minimal since the variance has to be the least for the model to make a good classification.
DIFFERENCE BETWEEN PCA AND LDA
Both techniques focus on reducing dimensionality and use eigenvalues and eigenvectors under the hood however the major difference is that PCA doesn’t take into account the class labels while the purpose of LDA is to make a decision boundary between two classes (meaning it focuses on making the data linearly separable)
- The difficulties related to training machine learning models due to high dimensional data are referred to as the ‘Curse of Dimensionality’.
- Principal Component Analysis is a way to reduce the number of variables while maintaining the majority of the important information.
- Linear Discriminant Analysis is used to find a linear combination of features that characterizes or separates two or more classes of objects or events. (linearly separable)
- PCA doesn’t take into account the class labels and LDA fails to create a new axis where the means of the distribution is shared.
- Up Next, I’ll be covering more Machine Learning Algorithms and how they compare and contrast with each other.
- If you liked the tips and they proved to be helpful to you, I’d appreciate it if you can give the article a clap and follow me for more upcoming Data Science, Machine Learning, and Artificial Intelligence articles.
Final Thoughts and Closing Comments
There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this article, I recommend you go with the Global Certificate in Data Science & AI because this one will cover your foundations, machine learning algorithms, and deep neural networks (basic to advance).