EDA with PySpark

  • PySpark is an interface for Apache Spark in Python.
  • It not only allows you to write Spark applications using Python APIs but also provides the PySpark shell for interactively analyzing your data in a distributed environment.
  • PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning), and Spark Core.
  • In this case study, we will look into creating a basic spark cluster, importing some data, checking the various dataframe functionalities, and performing EDA on the data using PySpark.
  • So let’s get started!

Installing & Importing Libraries

Before we begin, we need to install two important libraries for this case study:

  1. Pyspark: Using PySpark, you can work with RDDs in the Python programming language. A library called Py4j allows you to achieve this and offers PySpark Shell which links the Python API to the spark core and initializes the Spark context.
  2. HandySpark: It is a library that allows you to plot visualizations with a pyspark dataframe.
!pip install -q pyspark
!pip install -q handyspark
  1. SparkContext or SparkSession are used as the entry point to any spark functionality. When we run any Spark application, a driver program starts, which has the main function and your SparkContext/SparkSession gets initiated here. The driver program then runs the operations inside the executors on worker nodes. SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext.
  2. SQL module functions like col, when, isnan, Window, Row, and various other sub-functions.
  3. We are also importing some standard libraries like matplotlib and seaborn that will assist during the data visualization part.

Initializing a Spark Session

You can create a spark session using the SparkSession.builder.appName() module using the following code:

# Building a spark app/session
spark = SparkSession.builder.appName(“carsSpark”).getOrCreate()
# single cluster information
  • Here, we have named our app “carSparkand used the getOrCreate() function to retrieve (if exists already) or create the app.
  • You get the following output after writing the above code:

About the Dataset

  • We will use the CarDekho Indian Car Price dataset that can be found here. It is a simple dataset that can be used for this exercise.
  • The dataset has the following features:
  • PySpark has its own dataframe and functionalities.
  • Here we are going to read a single CSV into the dataframe using spark.read.csv() and then use that dataframe to perform the analysis:
# Reading the data
df = spark.read.csv(‘/content/car data.csv’, header=True, inferSchema=”true”)
# Shape of the dataset
print(‘Shape of the dataset: ‘, (df.count(), len(df.columns)))
# Displaying top n=10 rows
  • We set the inferSchema attribute as True, this will go through the CSV file and automatically adapt its schema into PySpark Dataframe.
  • We get the following output for the above command:

Data Description

  • df.describe().show() allows us to see a general description of the dataset, similar to what pandas dataframe.describe() does.
  • We get the following output:
  • We can further use df.printSchema() to get the schema information of the dataset:

Data Preprocessing

  • Pyspark dataframes don’t have readily available, easy to run functions to check for data inconsistencies and null values.
  • We are using a driver function, with the help of Pyspark’s inbuilt SQL functions to check for the null values in our dataframe:
Dataset doesn’t have null values
  • The dataset doesn't need any preprocessing so we can simply proceed with the EDA of this dataset.

Exploratory Data Analysis

  • PySpark dataframes do not support visualizations like pandas does with its plot() method.
  • A lot of users simply convert their dataframe to pandas which do not translate well for real-world big data as pandas require the whole data to be in the memory for processing and visualization.
  • But there is another library that bridges this gap by giving pandas-like functionality to your PySpark dataframe, without compromising with the drawbacks of pandas — HandySpark.
  • We can simply convert our pyspark dataframe to handyspark dataframe using the following code:
hdf = df.toHandy()
  • HandySpark provides a hist() function that can be used to plot histograms and bar plots.
  • Some of the univariate analysis that uses this hist() function method is as follows:

Question 1: What is the spread of the Selling_Price feature?

  • We see a right-skewed distribution.
  • Max price can go till 35 lacs causing the skew.
  • The majority of the cars have price listings under 15 lacs.

Question 2: What is the distribution of classes for the Seller_Type feature?

  • More than 175 sellers are dealers and the rest are simply individuals looking for selling their cars.

Question 3: What is the distribution of classes for the Transmission feature?

  • Most of the listed cars (n(cars) > 250) are of Manual transmission type.

Question 4: What is the distribution of classes for Owner feature?

  • We have three owner types where 0 indicates brand new cars (0 previous owners), 1 indicates that the car is second-hand, and 3 indicates that there were 3 owners before the car was listed.
  • There is just one row that has Owner=3.

Question 5: Determine the relation between Selling_Price and Kms_Driven features?

  • We don’t observe any linear relation between Kms_Driven and Selling_Price.

Question 6: Determine the relation between Selling_Price and Present_Price features?

  • We observe a slight linear relation between the present price and the selling price.

Question 7: Determine the relation between Selling_Price and Transmission Type features?

  • HandySpark offers a unique take on groupby like commands with the help of stratify() method which works like the split-apply-combine approach.
  • It will first split your HandyFrame according to the specified (discrete) columns, then it will apply some function to each stratum of data and finally combine the results back together.
  • This is better illustrated with an example — let’s try the stratified version of Transmission feature:
Automatic 9.420000
Manual 3.931992
Name: Selling_Price, dtype: float64
  • The mean value for Manual Transmission cars is 3.9 lacs and it is 9.4 lacs for Automatic. Let’s see the distribution of the same:
  • Most of the cars are of Manual transmission and are available under 20 lacs.
  • We can see that an equal number of Automatic transmission cars are present for all price ranges.

Question 8: Determine the relation between Selling_Price and Seller_Type features?

  • Individual sellers are selling their cars relatively cheaply than the proper dealers.

Question 9: Determine the relation between Selling_Price and Fuel_Type features?

  • CNG fuel-type cars are being sold the cheapest with their selling prices less than 5 lacs.
  • We can then see that petrol cars are listed in the most with max selling price around 20 lacs.
  • Diesel car prices can go as high as 35 lacs.

Question 10: Plot a heatmap to check for correlations between features?

  • Selling_Price is highly correlated with Presnet_Price and is mildly influenced by the Year of making as well.
  • We can see Present_Price is slightly related to Kms_Driven but is not influenced as much.
  • Kms_Driven and Year are inversely correlated, indicating that some listed cars are being driven since their make year.


  • We have finally performed EDA on the cars data, extracted some important insights that can be useful for model building.
  • You can find the notebook here.
  • In the next article, we will use a VectorAssembler for preparing our data for the machine learning model.
  • This will be proceeded by a linear regression training and evaluation where we will observe whether the model is able to fit the data properly or not.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store



One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!