The Tidbits of Apache Spark

10 min readMar 13, 2022

By Ashish Lepcha and Hiren Rupchandani

Since began out as a research task at the AMPLab at U.C. Berkeley in 2009, Apache Spark has ended up one of the key big data distributed processing frameworks globally. It can be deployed in numerous ways, supports Java, Scala, Python, and R programming languages, and also supports SQL, streaming data, ML, and graph processing.

Apache Spark is a data processing framework that could fast carry out processing duties on very massive data units and might unfold data processing duties throughout more than one computer system, both via way of means of itself or in aggregate with different allotted computing gear.
These traits are key to the world of big data and machine learning, which requires a huge change of computer power to pass big data stores.
Spark also removes some of the programming loads of these tasks off the shoulders of developers with an easy-to-use API that removes much of the complex computer workload and data processing.

Why Spark for Big Data?

Hadoop MapReduce is a programming tool for processing big datasets with a parallel, distributed algorithm. Developers can write huge parallelized operators, while not having to fear work distribution and fault tolerance.

However, an issue with MapReduce is the sequential multi-step process it takes to execute a task. With every step, MapReduce reads data from the cluster, executes operations, and writes the outcomes back to HDFS. MapReduce jobs are slower because every step calls for a disk check, and write, causing the latency of disk I/O to go up.

Spark was created to address the limitations to MapReduce, by doing processing in-memory, reducing the number of steps in a job, and reusing data across multiple parallel operations.
With Spark, only one step is needed where data is read into memory, operations performed, and the results are written back — resulting in much faster execution.
Spark also reuses data by using an in-memory cache to greatly speed up machine learning algorithms that repeatedly call a function on the same dataset.
Data re-use is accomplished through the creation of DataFrames, an abstraction over Resilient Distributed Dataset (RDD), which is a collection of objects that are cached in memory, and reused in multiple Spark operations.
This dramatically lowers the latency making Spark multiple times faster than MapReduce, especially when doing machine learning, and interactive analytics.

Features of Apache Spark

Fast processing — The most important feature of Apache Spark that has made the big data world choose this technology over others is its speed. Big data is characterized by the 6 Vs which needs to be processed at a higher speed. Spark contains Resilient Distributed Dataset which saves time in reading and writing operations, allowing it to run almost ten to one hundred times faster than Hadoop.
Flexibility — Apache Spark supports multiple languages and allows the developers to write applications in Java, Scala, R, or Python.
In-memory computing — Spark stores the data in the RAM of servers which allows quick access and in turn accelerates the speed of analytics.
Real-time processing — Spark can process real-time streaming data. Unlike MapReduce which processes only stored data, Spark can process real-time data and is, therefore, able to produce instant outcomes.
Better analytics — In contrast to MapReduce which includes Map and Reduce functions, Spark includes much more than that. It consists of a rich set of SQL queries, machine learning algorithms, complex analytics, etc. With all these functionalities, analytics can be performed in a better fashion with the help of Spark.

Apache Spark Architecture

Apache Spark Core — Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built upon. It provides in-memory computing and referencing datasets in external storage systems.
Spark SQL — Spark SQL is Apache Spark’s module for working with structured data. The interfaces offered by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed.
Spark Streaming — This component allows Spark to process real-time streaming data. Data can be ingested from many sources like Kafka, Flume, and HDFS (Hadoop Distributed File System). Then the data can be processed using complex algorithms and pushed out to file systems, databases, and live dashboards.
MLlib (Machine Learning Library) — Apache Spark is equipped with a rich library known as MLlib. This library contains a wide array of machine learning algorithms- classification, regression, clustering, and collaborative filtering. It also includes other tools for constructing, evaluating, and tuning ML Pipelines. All these functionalities help Spark scale out across a cluster.
GraphX — Spark also comes with a library to manipulate graph databases and perform computations called GraphX. GraphX unifies ETL (Extract, Transform, and Load) process, exploratory analysis, and iterative graph computation within a single system.

Apache Spark Cluster Manager

A cluster is a set of tightly or loosely coupled computers connected through LAN (nearby area network).
The computers inside the cluster are commonly referred to as nodes. Each node within the cluster will have separate hardware and running system or can share the equal amongst them.
Resource (Node) management and project execution inside the nodes is managed by way of a software program known as Cluster supervisor/manager.

Apache Spark is a cluster-computing framework on which programs can run as an independent set of methods.
In Spark cluster configuration there are master nodes and worker Nodes and the role of Cluster manager is to manipulate resources throughout nodes for better performance.
A consumer creates a Spark context and connects the cluster supervisor primarily based on the type of cluster supervisor is configured which includes YARN, Mesos, and so forth.

Features of Cluster Manager:

Apache Spark applications get their executor processes that run the various task and stay alert during the execution cycle.
Apache Spark can easily run on other Cluster managers such as YARN, Mesos which supports other applications as well.
Driver program continuously accepts the connection for executors during its continuance.
Apache Spark driver program schedules the task on the cluster and runs closure to the worker nodes. Spark Driver program should be on the same local area network.

Apache Spark supports mainly three Types of Cluster Managers:

Standalone
Apache Mesos
Hadoop YARN

Apache Spark Standalone Cluster

Apache Spark Standalone configuration gives a standalone machine configuration that has a master node and worker node and can be used for testing purposes.
We can start the Master node and worker nodes manually.
We can start the Apache Spark Master node using the following command

(./sbin/start-master.sh)

Once the Apache Spark Master node is started, then it will show the URL itself (spark://HOST: PORT UR) that we can use to connect with Spark worker nodes.
We can also use(http://localhost:8080) for the Spark Master web interface.
Apart from this, we can start the Spark worker node and make a connection with the Spark Master node using this command

($./sbin/start-slave.sh master-spark-URL.)

Apache Mesos

Apache Spark can work well on the cluster of nodes that are operated by the Apache Mesos.
Apache Mesos cluster has the configuration of Mesos Master nodes and Mesos Agent nodes.
Mesos Master handles the agent daemons which are running on nodes and Mesos frameworks are used to handle the task on agents.
The framework which runs on Apache Mesos has two components namely schedular and executor, the schedular is registered with the Master node and responsible to manage resources and the executer process is responsible to process framework tasks.
The following are the advantages of deploying Apache Spark on Apache Mesos:
- A dynamic partition is provided between Apache Spark and other frameworks.
- A scalable partitioning is provided between various instances of Spark.

Apache Hadoop YARN

Apache Spark can be deployed on the Hadoop YARN resource manager as well.
When applications run on YARN in that case each Spark executor runs as a YARN container and MapReduce schedules a container and starts a JVM for each task.
Spark achieves faster performance by hosting multiple tasks in the same container.
It has the following two modes.
- Cluster Deployment Mode: In cluster deployment mode driver program will run on the Application master server. It is responsible for driving the application and requesting the resource from YARN.
- Client Deployment Mode: In client deployment mode driver program will run on the host where a job is submitted and in this case, Application Master will just present there to ask executor containers from YARN and then those containers start after that client communicates with them to schedule work.

Resilient Distributed Dataset (RDD)

RDD (Resilient Distributed Dataset) is the fundamental data structure of Apache Spark which are an immutable collection of objects which computes on the different node of the cluster.
In Spark, anything you do will go around RDD. The dataset in Spark RDDs is divided into logical partitions.
If the data is logically partitioned within RDD, it is possible to send different pieces of data across different nodes of the cluster for distributed computing.
RDD helps Spark to achieve efficient data processing.

Features of an RDD in Spark

Resilience: RDDs track data lineage information to recover lost data, automatically on failure. It is also called fault tolerance.
Distributed: Data present in an RDD resides on multiple nodes. It is distributed across different nodes of a cluster.
Lazy evaluation: Data does not get loaded in an RDD even if we define it. Transformations are actually computed when we call action, such as count or collect, or save the output to a file system.
Immutability: Data stored in an RDD is in the read-only mode we cannot edit the data which is present in the RDD. But, we can create new RDDs by performing transformations on the existing RDDs.
In-memory computation: An RDD stores any immediate data that is generated in the memory (RAM) than on the disk so that it provides faster access.
Partitioning: Partitions can be done on any existing RDD to create logical parts that are mutable. We can achieve this by applying transformations to the existing partitions.

Operations of RDD

Two operations can be applied in RDD —Transformation & Action.

Transformations

Transformations are the processes that you perform on an RDD to get a result which is also an RDD.
The example would be applying functions such as filter(), union(), map(), flatMap(), distinct(), reduceByKey(), mapPartitions(), and sortBy() that would create an another resultant RDD.
Lazy evaluation is applied in the creation of RDD.

Actions

Actions return results to the driver program or write it in storage and kick off a computation.
Some examples are count(), first(), collect(), take(), countByKey(), collectAsMap(), and reduce().
Transformations will always return RDD whereas actions return some other data type.

When to use RDD?

When we want low-level transformation and actions and control on your dataset.
When our data is unstructured, such as media streams or streams of text.
If we want to manipulate your data with functional programming constructs instead of domain-specific expressions.
If we don’t care about imposing a schema, such as columnar format while processing or accessing data attributes by name or column.
If we can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

Limitations of RDD:

i. No inbuilt optimization engine: When working with structured data, RDDs cannot take advantage of Spark’s advanced optimizers including catalyst optimizer and Tungsten execution engine. Developers need to optimize each RDD based on its attributes.

ii. Handling structured data: Unlike Dataframe and datasets, RDDs don’t infer the schema of the ingested data and require the user to specify it.

iii. Performance limitation: Being in-memory JVM objects, RDDs involve the overhead of Garbage Collection and Java Serialization which are expensive when data grows.

iv. Storage limitation: RDDs degrade when there is not enough memory to store them. One can also store that partition of RDD on a disk that does not fit in RAM. As a result, it will provide similar performance to current data-parallel systems.

Conclusion

This sums up a very high-level overview of Spark and why it can be favored in place of the most popular framework like Apache Hadoop.

What’s Next?

That’s it for the theoretical part. In the next article, we will be focusing on the hands-on implementation of spark with the help of a python framework named as PySpark.

Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.