How Apache Spark Works Under the Hood (Explained Simply)

By Pratik ManeMarch 24, 2026

Writing and using Spark today has become incredibly accessible. Having said that, there’s a massive difference between writing Spark code that runs and code that scales. It’s tempting to treat Spark like a standard Python script when you're a beginner, but understanding how Spark works under the hood will help you write much more performant code and manage your cloud bills better.

If you are:

You’ve come to the right place.

The Core Components of a Spark Cluster

Before we start looking at how Spark works, it’s important to understand some the hardware level concepts.

Diagram showing the Spark Cluster hierarchy including Driver Node, Worker Nodes, Executors, and Cores

Note: Cluster, Node, CPU, and Core are all generic hardware terms and exist physically on the hardware. An Executor, however, is a Spark-specific software concept, meaning the number of executors can be configured right in your Spark code, as shown below:

spark_session.py
from pyspark.sql import SparkSession
 
# Initialize the SparkSession with specific executor configurations
spark = SparkSession.builder \
    .appName("datawarehouse_blog_processing") \
    .config("spark.executor.instances", "4") \
    .config("spark.executor.cores", "4") \
    .config("spark.executor.memory", "8g") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

Having a cluster makes it possible for Spark to process data in a distributed manner, making it incredibly fast.

The Classroom Analogy: Distributed Processing

Diagram showing the classroom analogy for spark working

Let's try to understand the distributed processing nature of Spark through a simple classroom analogy.

This is the essence of how Spark works. (Analogies: Instructor = Driver, Students = Executors, Pouch = Partition, Counting = Task).

Essential Spark Concepts You Need to Know

Key Components of the Spark Processing Engine

The Spark Driver

The master node (process) in a driver coordinates workers and oversees the tasks. Spark is split into jobs and schedules to be executed on executors on the cluster. Spark contexts (gateways) are created by the driver to monitor the job working in a specific cluster/node and to connect to the Spark cluster. The driver program calls the main application and creates the Spark context. Everything is executed using this context.

Each Spark session has an entry in the Spark context. Context acquires worker nodes to execute and store data as Spark clusters are connected to different types of cluster managers. When a process is executed in the cluster, the job is divided into stages, and those stages are broken down into scheduled tasks.

The Spark Executors

The executor is responsible for executing a job and storing the data in cache. Executors first register with the driver program at the beginning. These executors have a number of slots to run applications concurrently. The executor runs the task when it has loaded the data, and they are removed in idle mode. Executors are allocated dynamically and constantly added or removed during the execution of tasks. A driver program monitors the executors during their performance.

Worker Nodes

The slave nodes function as homes for executors, processing tasks, and returning the results back to the Spark context. The master node issues tasks, and the worker node executes them. They make the process simpler by handling as many jobs as possible in parallel by dividing the job up into sub-jobs on multiple machines. In Spark, a partition is a unit of work and is assigned to one executor core.

The Cluster Manager

The cluster manager acts as an external service that coordinates resources across the physical machines. It is responsible for acquiring the worker nodes and allocating the required memory and CPU cores for the executors. When the driver program connects to the cluster, it communicates with the cluster manager to request these resources before a job can begin.

The cluster manager oversees the available capacity but does not execute the user's tasks itself. Once it allocates the executors on the worker nodes, the cluster manager steps back, and the driver directly monitors the job. Spark is designed to be pluggable, meaning it can be connected to different types of cluster managers (YARN, Kubernetes, Mesos, or Spark Standalone). Ultimately, the cluster manager simply ensures the Spark application has the physical hardware to run concurrently.

The Execution Flow: How a Job Actually Runs

Diagram showing Key Components of the Spark Processing Engine

Interview Questions you can answer after this

Because understanding the internal architecture separates junior developers from senior engineers, these concepts are heavily tested in data engineering interviews. By mastering this material, you can confidently answer questions like:

TL;DR / Summary

At its core, Apache Spark is a master-slave architecture designed to divide and conquer massive datasets.

The entire process in a nutshell:

  1. You write your code and trigger an Action.
  2. The Driver (the brain) translates your code into a logical execution plan.
  3. The Cluster Manager finds the necessary hardware (Worker Nodes).
  4. The Driver splits your data into chunks (Partitions) and assigns them to Executors living on the Worker Nodes.
  5. The Executors process the data in parallel across multiple Cores and return the final result back to the Driver.

Understanding this flow is the first step to mastering distributed data systems.

Pratik Mane

About the Author

Pratik Mane is a Data Architect and Engineer specializing in Azure Data Platform and Databricks. He helps enterprises build scalable, high-performance ETL pipelines.