EDA with PySpark

You can find the notebook here with the codes and plots.

Installing & Importing Libraries

!pip install -q pyspark
!pip install -q handyspark

Initializing a Spark Session

# Building a spark app/session
spark = SparkSession.builder.appName(“carsSpark”).getOrCreate()
# single cluster information
spark

About the Dataset

# Reading the data
df = spark.read.csv(‘/content/car data.csv’, header=True, inferSchema=”true”)
# Shape of the dataset
print(‘Shape of the dataset: ‘, (df.count(), len(df.columns)))
# Displaying top n=10 rows
df.show(n=10)

Data Preprocessing

Dataset doesn’t have null values

Exploratory Data Analysis

hdf = df.toHandy()
hdf.show()
hdf.stratify(['Transmission']).cols['Selling_Price'].mean()OUTPUT:
Transmission
Automatic 9.420000
Manual 3.931992
Name: Selling_Price, dtype: float64

Conclusion

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store