The Tidbits of Apache Spark

By Ashish Lepcha and Hiren Rupchandani

Since began out as a research task at the AMPLab at U.C. Berkeley in 2009, Apache Spark has ended up one of the key big data distributed processing frameworks globally. It can be deployed in numerous ways, supports Java, Scala, Python, and R programming languages, and also supports SQL, streaming data, ML, and graph processing.

  • Apache Spark is a data processing framework that could fast carry out processing duties on very massive data units and might unfold data processing duties throughout more than one computer system, both via way of means of itself or in aggregate with different allotted computing gear.

Why Spark for Big Data?

Hadoop MapReduce is a programming tool for processing big datasets with a parallel, distributed algorithm. Developers can write huge parallelized operators, while not having to fear work distribution and fault tolerance.

However, an issue with MapReduce is the sequential multi-step process it takes to execute a task. With every step, MapReduce reads data from the cluster, executes operations, and writes the outcomes back to HDFS. MapReduce jobs are slower because every step calls for a disk check, and write, causing the latency of disk I/O to go up.

  • Spark was created to address the limitations to MapReduce, by doing processing in-memory, reducing the number of steps in a job, and reusing data across multiple parallel operations.

Features of Apache Spark

  1. Fast processing — The most important feature of Apache Spark that has made the big data world choose this technology over others is its speed. Big data is characterized by the 6 Vs which needs to be processed at a higher speed. Spark contains Resilient Distributed Dataset which saves time in reading and writing operations, allowing it to run almost ten to one hundred times faster than Hadoop.

Apache Spark Architecture

  1. Apache Spark Core — Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built upon. It provides in-memory computing and referencing datasets in external storage systems.

Apache Spark Cluster Manager

  • A cluster is a set of tightly or loosely coupled computers connected through LAN (nearby area network).
  • Apache Spark is a cluster-computing framework on which programs can run as an independent set of methods.

Features of Cluster Manager:

  • Apache Spark applications get their executor processes that run the various task and stay alert during the execution cycle.

Apache Spark supports mainly three Types of Cluster Managers:

  1. Standalone

Apache Spark Standalone Cluster

  • Apache Spark Standalone configuration gives a standalone machine configuration that has a master node and worker node and can be used for testing purposes.
(./sbin/start-master.sh)
  • Once the Apache Spark Master node is started, then it will show the URL itself (spark://HOST: PORT UR) that we can use to connect with Spark worker nodes.
($./sbin/start-slave.sh master-spark-URL.)

Apache Mesos

  • Apache Spark can work well on the cluster of nodes that are operated by the Apache Mesos.

Apache Hadoop YARN

  • Apache Spark can be deployed on the Hadoop YARN resource manager as well.

Resilient Distributed Dataset (RDD)

  • RDD (Resilient Distributed Dataset) is the fundamental data structure of Apache Spark which are an immutable collection of objects which computes on the different node of the cluster.

Features of an RDD in Spark

  • Resilience: RDDs track data lineage information to recover lost data, automatically on failure. It is also called fault tolerance.

Operations of RDD

Two operations can be applied in RDD —Transformation & Action.

Transformations

  • Transformations are the processes that you perform on an RDD to get a result which is also an RDD.

Actions

  • Actions return results to the driver program or write it in storage and kick off a computation.

When to use RDD?

  1. When we want low-level transformation and actions and control on your dataset.

Limitations of RDD:

i. No inbuilt optimization engine: When working with structured data, RDDs cannot take advantage of Spark’s advanced optimizers including catalyst optimizer and Tungsten execution engine. Developers need to optimize each RDD based on its attributes.

ii. Handling structured data: Unlike Dataframe and datasets, RDDs don’t infer the schema of the ingested data and require the user to specify it.

iii. Performance limitation: Being in-memory JVM objects, RDDs involve the overhead of Garbage Collection and Java Serialization which are expensive when data grows.

iv. Storage limitation: RDDs degrade when there is not enough memory to store them. One can also store that partition of RDD on a disk that does not fit in RAM. As a result, it will provide similar performance to current data-parallel systems.

Conclusion

This sums up a very high-level overview of Spark and why it can be favored in place of the most popular framework like Apache Hadoop.

What’s Next?

That’s it for the theoretical part. In the next article, we will be focusing on the hands-on implementation of spark with the help of a python framework named as PySpark.

Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.

--

--

One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
INSAID

One of India’s leading institutions providing world-class Data Science & AI programs for working professionals with a mission to groom Data leaders of tomorrow!