EDA with PySpark
By Hiren Rupchandani and Abhinav Jangir
- PySpark is an interface for Apache Spark in Python.
- It not only allows you to write Spark applications using Python APIs but also provides the PySpark shell for interactively analyzing your data in a distributed environment.
- PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning), and Spark Core.
- In this case study, we will look into creating a basic spark cluster, importing some data, checking the various dataframe functionalities, and performing EDA on the data using PySpark.
You can find the notebook here with the codes and plots.
- So let’s get started!
Installing & Importing Libraries
Before we begin, we need to install two important libraries for this case study:
- Pyspark: Using PySpark, you can work with RDDs in the Python programming language. A library called Py4j allows you to achieve this and offers PySpark Shell which links the Python API to the spark core and initializes the Spark context.
- HandySpark: It is a library that allows you to plot visualizations with a pyspark dataframe.
To install these libraries, simply write the following commands:
!pip install -q pyspark
!pip install -q handyspark
After installing the libraries, we need to import the pyspark library along with some important classes and modules:
- SparkContext or SparkSession are used as the entry point to any spark functionality. When we run any Spark application, a driver program starts, which has the main function and your SparkContext/SparkSession gets initiated here. The driver program then runs the operations inside the executors on worker nodes. SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext.
- SQL module functions like col, when, isnan, Window, Row, and various other sub-functions.
- We are also importing some standard libraries like matplotlib and seaborn that will assist during the data visualization part.
Initializing a Spark Session
You can create a spark session using the SparkSession.builder.appName() module using the following code:
# Building a spark app/session
spark = SparkSession.builder.appName(“carsSpark”).getOrCreate()# single cluster information
- Here, we have named our app “carSpark” and used the getOrCreate() function to retrieve (if exists already) or create the app.
- You get the following output after writing the above code:
About the Dataset
- We will use the CarDekho Indian Car Price dataset that can be found here. It is a simple dataset that can be used for this exercise.
- The dataset has the following features:
- PySpark has its own dataframe and functionalities.
- Here we are going to read a single CSV into the dataframe using
spark.read.csv()and then use that dataframe to perform the analysis:
# Reading the data
df = spark.read.csv(‘/content/car data.csv’, header=True, inferSchema=”true”)# Shape of the dataset
print(‘Shape of the dataset: ‘, (df.count(), len(df.columns)))# Displaying top n=10 rows
- We set the inferSchema attribute as True, this will go through the CSV file and automatically adapt its schema into PySpark Dataframe.
- We get the following output for the above command:
df.describe().show()allows us to see a general description of the dataset, similar to what pandas
- We get the following output:
- We can further use
df.printSchema()to get the schema information of the dataset:
- Pyspark dataframes don’t have readily available, easy to run functions to check for data inconsistencies and null values.
- We are using a driver function, with the help of Pyspark’s inbuilt SQL functions to check for the null values in our dataframe:
- The dataset doesn't need any preprocessing so we can simply proceed with the EDA of this dataset.
Exploratory Data Analysis
- PySpark dataframes do not support visualizations like pandas does with its plot() method.
- A lot of users simply convert their dataframe to pandas which do not translate well for real-world big data as pandas require the whole data to be in the memory for processing and visualization.
- But there is another library that bridges this gap by giving pandas-like functionality to your PySpark dataframe, without compromising with the drawbacks of pandas — HandySpark.
- We can simply convert our pyspark dataframe to handyspark dataframe using the following code:
hdf = df.toHandy()
- HandySpark provides a
hist()function that can be used to plot histograms and bar plots.
- Some of the univariate analysis that uses this
hist()function method is as follows:
Question 1: What is the spread of the
- We see a right-skewed distribution.
- Max price can go till 35 lacs causing the skew.
- The majority of the cars have price listings under 15 lacs.
Question 2: What is the distribution of classes for the
- More than 175 sellers are dealers and the rest are simply individuals looking for selling their cars.
Question 3: What is the distribution of classes for the
- Most of the listed cars (n(cars) > 250) are of Manual transmission type.
Question 4: What is the distribution of classes for
- We have three owner types where 0 indicates brand new cars (0 previous owners), 1 indicates that the car is second-hand, and 3 indicates that there were 3 owners before the car was listed.
- There is just one row that has Owner=3.
Let’s see some Multivariate Analysis. HandySpark also allows you to plot a scatterplot with its
scatterplot() function. Let’s see some examples:
Question 5: Determine the relation between
- We don’t observe any linear relation between Kms_Driven and Selling_Price.
Question 6: Determine the relation between
- We observe a slight linear relation between the present price and the selling price.
Question 7: Determine the relation between
Transmission Type features?
- HandySpark offers a unique take on groupby like commands with the help of
stratify()method which works like the split-apply-combine approach.
- It will first split your HandyFrame according to the specified (discrete) columns, then it will apply some function to each stratum of data and finally combine the results back together.
- This is better illustrated with an example — let’s try the stratified version of
Name: Selling_Price, dtype: float64
- The mean value for Manual Transmission cars is 3.9 lacs and it is 9.4 lacs for Automatic. Let’s see the distribution of the same:
- Most of the cars are of Manual transmission and are available under 20 lacs.
- We can see that an equal number of Automatic transmission cars are present for all price ranges.
Question 8: Determine the relation between
- Individual sellers are selling their cars relatively cheaply than the proper dealers.
Question 9: Determine the relation between
- CNG fuel-type cars are being sold the cheapest with their selling prices less than 5 lacs.
- We can then see that petrol cars are listed in the most with max selling price around 20 lacs.
- Diesel car prices can go as high as 35 lacs.
Question 10: Plot a heatmap to check for correlations between features?
- Selling_Price is highly correlated with Presnet_Price and is mildly influenced by the Year of making as well.
- We can see Present_Price is slightly related to Kms_Driven but is not influenced as much.
- Kms_Driven and Year are inversely correlated, indicating that some listed cars are being driven since their make year.
- We have finally performed EDA on the cars data, extracted some important insights that can be useful for model building.
- You can find the notebook here.
- In the next article, we will use a VectorAssembler for preparing our data for the machine learning model.
- This will be proceeded by a linear regression training and evaluation where we will observe whether the model is able to fit the data properly or not.
Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.
Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.