The 4 Key Concepts in the Anatomy of an Apache Spark Job !

4 min readDec 9, 2017

The 4 Key Concepts in the Anatomy of an Apache Spark Job !

The 4 Key Concepts in the Anatomy of an Apache Spark Job !

For Big Data & Cloud Community members Apache Spark is Awesome to handle any workloads such as Batch, Streaming, Real-Time, and Ad-hoc. However, to fine tune and optimize of our Apache Spark Applications we need to have a grip on the Apache Spark Jobs, and especially the four key concepts are The DAG, Jobs, Stages, and Tasks. Hence, I would like to share my learnings with our community.

In general, a Spark application doesn’t act anything till the driver program calls an action; which is in the Spark called lazy evaluation paradigm. A spark job is launched for each action, and each job consist of stages, where the materialized final RDD will be transformed. And each stage consists of collection of tasks.

The DAG:

A DAG is Directed Acyclic Graph which is built by layer of RDD dependencies with stages for each Spark application job. And it can be called as DAG scheduler in the Spark API context. It builds a graph for any Spark job submitted, and it determines the locations to run each task. From this point Task Scheduler will take care, and leverages the dependencies between partitions.

Job

The Job is top level execution for any Spark application; which is corresponds to an action. This action is called by driver program. And in other way to define the action is as something that brings data out of the RDD world of Spark into some other storage system to something like stable storage system. The edges of the Spark execution graph are based on dependencies between the partitions in RDD transformations. The Spark app can’t add to that graph, once the action is called. The application launches a job including those transformations that were needed to evaluate the final RDD that called the action.

Stages:

The transformations are not executed until an action is called, and it might include one or many transformations and the wide transformations define the breakdown of jobs in to stages, which is corresponds to a shuffle dependency. And a new stage begins whenever the communication between workers is required, and the dependencies are called ShuffleDependencies. And the stages associated with one job generally have to be executed in sequence rather than in parallel. It is possible to execute stages in parallel if they are used to compute different RDDs that are combined in a downstream transformation such as a join.

Tasks:

The tasks are created by stages; which is the smallest unit in the execution in the Spark applications. The each task represent the local computation, where all of the tasks in a stage have same code but will be executed parallel in different datasets. And he number of tasks per stage corresponds to the number of partitions in the output RDD of that stage.

Hence the total number of executor cores is equal to no of cores per executor multiplied by number of executors. And the process of distributing these tasks is achieved by the TaskScheduler and differs based on the fair scheduler or FIFO scheduler. The FIFO schedule is the default Spark scheduler.

And in simple way one stage can be computed without moving data across the partitions, but within one stage, the tasks are the units of work done for each partition of the data

To conclude, the Spark offers an innovative, efficient model of parallel computing that centers on lazily evaluated, immutable, distributed datasets, known as RDDs. And it is possible to do the same tasks in many different ways using the Spark API, and a strong understanding of how our code is executed, will surely help us to optimize our Spark application performance.

Ref. High Performance Saprk by Holden, Rachel.

Let us have coffee@dataottam.com.

Please subscribe dataottam blog to keep yourself up-to-the-minute on Data life cycle from inception to intelligence.

Bye Bye, see you in next post, Happy Data !

Written by Kumar Chinnakali