Week 1 · 70-minute reading · 5 widgets · 15-question quiz

Cluster Mode Overview

How Spark works under the hood: drivers, executors, cluster managers, and how your code becomes work on the cluster.

1. Why Spark Exists

The problem

Imagine you have a file that is 2 terabytes (2,000 GB). You want to count how many rows in it contain the word "error". Your laptop has 16 GB of RAM. There is no way to load the whole file into memory. Even if you read it line by line from disk, on a single machine it would take many hours, possibly days.

Now imagine you have a thousand machines instead of one. If each machine processes 2 GB of the file in parallel, the work that would have taken a day can finish in minutes. This is the idea behind distributed computing: take one big problem, split it into many small problems, run them at the same time on many machines, then combine the results.

The hard part is not the idea. The hard part is doing it correctly. When you run code on a thousand machines, you have to worry about all of this:

How do you split the data fairly?
What happens when one machine crashes mid-way?
What if one machine is slow and holds up everyone else?
How do you collect the results back without one machine drowning in too much data?
How do you let people write normal code instead of distributed-systems code?

What Spark is

Apache Spark is a framework that handles all of the above for you. You write code that looks almost like normal Python (or Scala, Java, R). Spark figures out how to split the work across many machines, runs it, recovers from failures, and gives you the results. It does this through a driver that plans the work and a fleet of executors that do the work.

That is the entire premise. Everything else you will learn is a detail of how this is done well.

A useful mental model

Think of a Spark cluster as one giant computer. Instead of having 8 CPU cores and 16 GB of RAM like your laptop, this giant computer has, say, 800 CPU cores and 3,200 GB of RAM, scattered across many physical machines. Spark's job is to make that pile of separate machines act like one machine when you write code. This mental model is not perfectly accurate, but it is useful. The cracks in the model are exactly where the interesting performance problems live.

2. The Cluster from a Bird's Eye View

Every Spark application has three main players. They live on different machines, they talk over the network, and they have very specific jobs. If you understand who they are and what they do, you understand 80% of how Spark works.

Here is how a Spark application starts up:

You run your code. A driver process starts.
The driver asks the cluster manager for some resources (CPUs and memory).
The cluster manager finds free machines and launches executor processes on them.
The driver now talks directly to its executors. It sends them tasks. They run the tasks and send results back.

Notice that once the executors are running, the cluster manager mostly steps out of the picture. It does not orchestrate the work. It only gave you the resources. The actual coordination — sending tasks, collecting results, retrying on failure — happens directly between the driver and the executors.

3. The Driver Program (The Boss)

The driver is the brain of a Spark application. There is always exactly one driver per application. If the driver dies, the application dies. Everything starts here and ends here.

What lives inside the driver

The driver is a JVM process (yes, even when you write PySpark — there is a Java Virtual Machine running underneath, and your Python code talks to it). Inside the driver, you will find:

Your code. The whole PySpark script you wrote — every filter, every groupBy, every variable you defined. It lives in the driver's memory.
The SparkContext (and SparkSession). This is the object that represents your connection to the cluster.
The DAG Scheduler. Turns your query into a logical plan, then a physical plan, then a graph of stages.
The Task Scheduler. Takes the stages from the DAG scheduler, breaks them into tasks, and sends them to the executors.
Status of every executor. The driver keeps track of which executors are alive, what tasks they are running, and which ones have failed.

A common pitfall

Because the driver runs your code, anything you do that is not a Spark operation runs only on the driver. This can surprise people. For example:

# This runs ON THE DRIVER, not on the cluster
rows = df.collect()   # bring all rows to the driver
for row in rows:      # loop on the driver
    do_something(row) # this is plain Python, not Spark

If the DataFrame has 100 million rows, collect() will try to load all 100 million rows into the driver's memory. The driver will run out of memory and your application will crash.

4. The Cluster Manager (The HR Department)

The cluster manager has one job: hand out resources. It is like the HR department of a company. The HR department does not write code or run meetings — it just decides who gets a desk, who gets a laptop, and who gets hired. Once you have your resources, you do your own work.

The three cluster managers you should know

Standalone — Built into Spark itself. Easy to set up. Good for small clusters that run only Spark. What people use for learning.
YARN — Hadoop's resource manager. Common at large enterprises with on-premise data. Runs many kinds of applications, enforces resource quotas, manages large clusters.
Kubernetes — A generic container orchestration system. Increasingly the standard for new Spark deployments, especially in the cloud. Executors and driver run as containers.

The reason Spark has this clean split — driver does the planning, cluster manager hands out resources, executors do the work — is that it lets you change one piece without changing the others. You can take a PySpark script that runs on a laptop with the Standalone manager and run the exact same script on a 1,000-node YARN cluster, and you do not change a single line of code.

Interactive

Cluster resource simulator

Adjust executors, cores per executor, and partitions to see how parallelism actually emerges. Watch what happens when partitions don't match cluster size.

Executors4

132

Cores per executor4

116

Partitions in stage64

1400

Avg task duration1000ms

505000

Total slots

4 × 4

Tasks in parallel

Waves of tasks

Est. stage time

4.0s

Even distribution — 100% slot utilization across 4 waves.

Visualization · slots per wave

Wave 1

Wave 2

Wave 3

Wave 4

task running slot idle

5. Executors (The Workers)

Executors are where the actual data processing happens. They are JVM processes that run on worker nodes in the cluster. A typical Spark application will have many executors — anywhere from 2 to 2,000 of them, depending on how big your cluster is.

Each executor has two jobs and only two jobs:

Run tasks that the driver sends it.
Store data in memory (or on disk) for caching and for shuffle operations.

That is it. An executor does not know about other executors. It does not plan work. It does not decide what to do next. It just takes orders from the driver and executes them.

Cores and slots inside an executor

Each executor has a fixed number of CPU cores assigned to it. Each core can run one task at a time. So an executor with 4 cores can run 4 tasks in parallel. If you have 10 executors with 4 cores each, your application can run up to 40 tasks at the same time. This is your application's level of parallelism.

max parallel tasks = (number of executors) × (cores per executor)

In configuration parameters you will see spark.executor.instances (the number of executors) and spark.executor.cores (cores per executor). These are two of the most important tuning knobs in Spark.

Worker node vs. executor

These two terms get mixed up a lot. Let us be precise. A worker node is a physical (or virtual) machine in the cluster. It has a hostname, an IP address, some number of CPU cores, and some amount of RAM. An executor is a process — a running JVM — that lives on a worker node. One worker node can have multiple executors, or just one, depending on configuration. Think of a worker node as a building, and executors as rooms inside that building.

6. The Work Hierarchy — Application, Job, Stage, Task

Now that you know who the players are, it is time to understand the structure of the work itself. Spark uses four levels of granularity, from the biggest unit to the smallest: Application, Job, Stage, Task.

Application

An application is your whole PySpark program. It has one driver, one or more executors, and one SparkSession. When your script ends (or you call spark.stop()), the application is over and the executors are released. Each application gets its own private executors. Two applications do not share executors, even on the same cluster.

Job

A job is the work triggered by a single action. Spark is lazy: transformations likefilter,select,groupBydo not actually run anything. They just build up a plan. The plan is only executed when you call an action.

Common actions:

.count() — count the rows
.collect() — bring all rows to the driver
.show() — print the first 20 rows
.take(n) — bring n rows to the driver
.write.parquet(path) — write data to storage
.toPandas() — convert to a pandas DataFrame on the driver

Stage

A job is split into one or more stages. A stage is a group of operations that can be done without any data movement between executors. The boundary between stages is always a shuffle — that point in the plan where data has to be reshuffled across the network to continue.

df.filter(...)              # narrow — same stage
  .select(...)              # narrow — same stage
  .groupBy("country")       # WIDE — stage boundary!
  .count()                  # part of next stage
  .write.parquet(...)

Task

A stage is split into tasks. There is one task per partition of data. If the data in a stage has 200 partitions, that stage will have 200 tasks. Each task runs on one executor, processes one partition of data, and produces output for that partition. Tasks are the smallest unit of work in Spark — they are what gets serialized and sent over the network, what fails or succeeds, what gets retried.

1 application = N jobs
1 job         = M stages
1 stage       = K tasks
1 task        = work on 1 partition

Interactive

Application → Job → Stage → Task

Click each level to see what it means, how it's bounded, and where it lives in the Spark UI.

Stage

A group of operations that can run end-to-end without a shuffle. The boundary between stages is always a shuffle — every wide transformation creates a new stage.

Example

df.filter(...)        # Stage 0
  .select(...)        # still Stage 0
  .groupBy('country')   # SHUFFLE → Stage 1
  .count()

By the numbers

Stage 0 has K tasks (one per input partition) · Stage 1 has spark.sql.shuffle.partitions tasks (default 200)

7. Why Stages Get Split (Narrow vs Wide Dependencies)

A stage boundary is always a shuffle. But what is a shuffle, and why does it create a boundary? The answer comes down to how each operation depends on its input data.

Narrow dependency

An operation has a narrow dependency on its input if each output partition only needs data from one input partition. The output is computed by looking at one partition at a time, locally on the executor that already has that partition.

Examples: filter, map / select / withColumn, union, drop. Narrow operations are cheap — no data moves between executors. Multiple narrow operations get fused together into a single stage.

Wide dependency

An operation has a wide dependency if an output partition might need data from many — possibly all — input partitions. To compute the output, data has to be reshuffled across the network.

Examples: groupBy, join, distinct, orderBy, repartition, window functions over a partition. Wide operations are expensive. They cause a shuffle: writing intermediate data to disk, transferring it over the network, and reading it back on another executor.

Interactive

Build a pipeline · watch stages form at shuffle boundaries

Click operations from the palette to add them to your pipeline. Wide operations create a stage boundary; narrow operations stay in the same stage.

Operation palette

Your pipeline · 2 stages, 1 shuffle

Stage 0

filter()select()

Shuffle → data crosses the network

Stage 1

groupBy()withColumn()

narrow (no shuffle)wide (creates a stage boundary)

8. Transformations vs Actions (Lazy Evaluation)

Spark operations come in two flavors: transformations and actions. They behave very differently.

Transformations are lazy

A transformation describes how to derive a new DataFrame from an existing one. When you call a transformation, nothing actually happens. No data is read. No work is done on the executors. Spark only updates its internal plan.

df = spark.read.csv("s3://big/file.csv", header=True)  # nothing happens
df2 = df.filter(df.country == "LK")                    # nothing happens
df3 = df2.groupBy("city").count()                      # nothing happens
df4 = df3.orderBy(df3["count"].desc())                 # nothing happens

# At this point, NOT A SINGLE BYTE HAS BEEN READ
df4.show()   # NOW it runs

Until that .show() at the end, your script is just building up a recipe. The recipe is not cooked yet.

Actions trigger execution

An action is an operation that needs an actual result. When you call an action, Spark looks at the recipe you have built up, makes a plan, and executes it. Every action you call triggers a new job.

Why lazy evaluation is a big deal

Because Spark sees your entire plan before it runs anything, it can make global optimizations a step-by-step engine could not. For example:

Filter after a join? Spark might rewrite the plan to filter before the join, so less data flows through the join.
Select 3 out of 100 columns at the end? Spark pushes that selection all the way down to the file reader, so only those 3 columns are read from disk.
Two transformations that could be fused into one loop? Spark fuses them.

All of this is done by a component called the Catalyst optimizer. The plan you write is not the plan that gets executed.

Interactive

Lazy evaluation timeline · step through the code

Walk through each line. Watch the plan grow without doing any work until the action at the end.

Step 1 of 5

Code so far

df = spark.read.csv('sales.csv', header=True)
df2 = df.filter(df.country == 'LK')
df3 = df2.groupBy('city').sum('amount')
df4 = df3.orderBy('sum(amount)', ascending=False)
df4.show()

What Spark just did

Transformation — recorded only

Records 'read sales.csv' in the plan. No bytes touched.

Transformations recorded

Actions fired

Bytes processed

9. How Code Becomes Work on the Cluster

Let us put everything together. You write PySpark code. What exactly happens between you pressing Enter and the results appearing?

You write PySpark code. Spark updates its internal plan (the logical plan). No execution yet.
You call an action. The logical plan is analyzed, optimized by Catalyst, and turned into a physical plan.
Spark builds a DAG — a directed acyclic graph of operations.
Spark splits the DAG into stages. It cuts at every shuffle boundary.
Spark splits each stage into tasks. One task per partition. Number of partitions depends on the input (first stage) or on spark.sql.shuffle.partitions (default 200) after a shuffle.
The driver sends tasks to executors. Each task is serialized and shipped over the network.
Executors run the tasks and either write output to disk (for a shuffle) or send results to the driver (for actions like collect, show).
The driver receives the results. It prints them, returns them as a Python object, or coordinates writing them to storage.

10. Deploy Modes — Client vs Cluster

When you submit a Spark application, you have to decide where the driver should run. There are two choices: client mode and cluster mode. The choice does not affect what your code does, but it affects how the application is structured and how reliable it is in production.

Client mode

In client mode, the driver runs on the machine where you launched the application.

You see driver logs and output in your terminal as the job runs.
You can interact with the driver (notebooks, REPL, debugging).
If your laptop disconnects, sleeps, or crashes, the driver dies — and the application dies with it.
Network traffic between driver and executors crosses whatever network is between your laptop and the cluster — potentially slow.

Cluster mode

In cluster mode, when you run spark-submit, the cluster manager picks a machine inside the cluster and launches the driver there. Your laptop submits the application and then exits. The application continues running on the cluster.

Driver logs are written on the cluster — you retrieve them via YARN UI / Kubernetes pod logs.
You cannot interactively use the driver — there is no REPL.
If your laptop dies, the application keeps running. This is the big win.
Driver-to-executor traffic stays inside the cluster — fast.

Setting the mode with spark-submit

# Client mode (default in most setups)
spark-submit --deploy-mode client --master yarn my_app.py

# Cluster mode
spark-submit --deploy-mode cluster --master yarn my_app.py

The --master flag tells Spark which cluster manager to use: local[*], spark://host:port, yarn, or k8s://....

Interactive

Client mode vs cluster mode · pick a scenario

The choice doesn't change what your code does — it changes where the driver lives. Pick a scenario to see how each mode handles it.

Scenario

You close your laptop mid-job. What happens?

Client mode

Driver runs on the machine that launched spark-submit.

Driver was running on your laptop — it dies with the lid. The whole application dies. All progress lost.

Cluster mode

Better fit

Driver runs on a node inside the cluster.

Driver runs on a cluster node. Your laptop was just the submitter and exited after submit. Job keeps running, you can re-check later.

Rule of thumb: client mode for development / notebooks; cluster mode for production batch jobs and scheduled pipelines.

11. Lifecycle of a Spark Application

From the moment you press Enter on spark-submit to the moment your application exits, there are six moments worth knowing.

spark-submit runs. A driver process is created. In client mode this happens on your machine. In cluster mode, the cluster manager starts the driver inside the cluster.
SparkContext (and SparkSession) is created. The driver opens a connection to the cluster manager.
Executors are allocated. The cluster manager launches executor processes on worker nodes. They register with the driver.
Tasks run on executors. Each action triggers a job; the driver plans it, splits it into stages and tasks, and sends them to executors. This is where most of the runtime is spent.
Results flow back. Rows come back (for collect, show, take) or files are written to storage.
SparkContext stops. Your script ends or calls spark.stop(). The cluster manager shuts down the executors. The driver exits. The application is over.

12. Common Misunderstandings

Claim

“Spark is fast because it runs in memory”

Truth

Half true. Spark also spills to disk during shuffles. The bigger reason Spark is fast is whole-stage code generation (Catalyst compiles many operations into one tight loop) and intelligent planning. 'In-memory' is a marketing slogan more than a complete explanation.

Claim

“More executors are always faster”

Truth

No. Adding executors helps until you run out of partitions to give them, or until the shuffle overhead grows faster than the benefit. With 4 partitions of data, 100 executors is useless — 96 of them sit idle. There is a sweet spot for every job.

Claim

“Spark and Hadoop are the same thing”

Truth

No. Hadoop is an ecosystem (HDFS + YARN + MapReduce). Spark is just a computation framework. Spark can run on YARN and read from HDFS, but it does not depend on either. You can run Spark on Kubernetes against files on S3 with no Hadoop in sight.

Claim

“DataFrames are slower than RDDs because they're a higher-level API”

Truth

Backwards. DataFrames are usually faster than RDDs, often dramatically so. They go through Catalyst (which optimizes) and Tungsten (which generates efficient bytecode). RDDs are opaque to Spark — black-box functions — so Spark cannot optimize them.

Claim

“A larger driver makes things faster”

Truth

Usually no. The driver does not do data processing — it plans and coordinates. A bigger driver helps only when you have very wide queries, very large plans, or you call collect() on too much data. For most apps, default driver memory is fine; give more memory to executors.

Self-check quiz

Score 80% to unlock the next week

15 questions, multiple choice. You can retake as many times as you like. Wrong answers come back with a link to the exact section of the reading that answers them.

You'll need to enter your email at the top of the page before submitting.