Mastering Apache Spark.
A documentation-driven study path through how Apache Spark really works. Four weeks of long-form, source-grounded readings paired with interactive visualizations you can manipulate to see Spark's internals in action.
One email to unlock everything. Used to remember your progress across devices.
What this study path is, and isn't
This is not a fast-paced bootcamp, and it is not a video lecture series. It is a serious, documentation-driven written study path — closer to working through a well-written book than watching tutorials. Each week pairs roughly 70–90 minutes of careful reading with four or five interactive visualizations that let you manipulate Spark internals directly: change executor counts and watch parallelism shift, drag transformations into a pipeline and see stages form at shuffle boundaries, kill an executor and watch the lineage walk back to rebuild a lost partition.
Every concept is grounded in the official Apache Spark documentation (spark.apache.org/docs/latest). When the docs say something, that's what you'll read. When the docs are quiet on a topic that matters, you'll see the trade-offs spelled out explicitly. No fluff, no hype, no marketing — just how Spark works.
At the end of each week is a 15-question self-check quiz. Score 80% to unlock the next week. Wrong answers link back to the exact section of the reading that explains the concept — so "getting it wrong" becomes the most efficient way to find the gap in your understanding.
The four weeks
- Week 170m read · 5 widgets · 15-question quiz
Cluster Mode Overview
How Spark works under the hood: drivers, executors, cluster managers, and how your code becomes work on the cluster.
- The three main players: driver, cluster manager, executors
- Application → Job → Stage → Task hierarchy
- Narrow vs wide dependencies
- Transformations vs actions (lazy evaluation)
- … and 2 more sections
- Week 280m read · 5 widgets · 15-question quiz
RDDs — The Foundation
What an RDD really is, the five properties, transformations, actions, lineage, and why fault tolerance is essentially free.
- What RDD stands for and why immutability matters
- The five internal properties of every RDD
- Creating RDDs: parallelize, textFile, transformations
- The reduceByKey vs groupByKey performance trap
- … and 2 more sections
- Week 390m read · 4 widgets · 15-question quiz
Shuffles, Partitioning & Persistence
The performance trio: what really happens during a shuffle, how partitioning controls parallelism, and when caching actually helps.
- Anatomy of a shuffle: map side, disk, network, reduce side
- Why shuffles are expensive (network, disk, serialization, GC)
- Partitioning depth and the Goldilocks problem
- Hash partitioner, range partitioner, custom partitioners
- … and 4 more sections
- Week 480m read · 4 widgets · 15-question quiz
Shared Variables — Broadcast & Accumulators
The two ways Spark lets the driver and executors share state — and why every other approach silently breaks.
- The closure problem made concrete
- Broadcast variables: what they are and how they work
- Broadcast joins — the killer use case
- Accumulators and the at-least-once trap
- … and 2 more sections
Why interactive visualizations
Most Spark tutorials hand you code. That's useful when you already know what you should expect to see. It's much less useful when you're trying to build the intuition for why a shuffle is expensive, or why one task in a stage holds up everyone else, or what the difference between repartition and coalesce actually looks like inside the cluster.
The widgets in this course are designed to give you that intuition before you ever open a Spark UI. Each one strips away the runtime noise and lets you drive a single concept directly. Move the executors slider, the parallelism number updates immediately. Pick wider partitions, watch the shuffle file count grow. Mark an operation as wide, see the stage boundary appear. The point isn't to simulate Spark — it's to make the mental model so concrete that the real Spark UI becomes legible at a glance.
Who this is for
- Data engineers who use Spark daily and want to fill in the gaps that cargo-cult Stack Overflow answers have left in their mental model.
- Software engineers crossing into data who already know how distributed systems work in general but want to understand Spark's specific choices.
- Senior engineers tuning slow jobs who can read a Spark UI but can't always tell which knob to turn first.
- People preparing for data engineering interviews where the conversation drifts to "okay, explain how reduceByKey actually works under the hood."
How the gating works
You unlock the course with an email address. That's the entire payment. The email is used to remember your progress across devices — so you can read on a laptop and quiz on a phone — and to occasionally tell you when a new study path goes live. No spam, unsubscribe anytime, and we don't share the address.
Each week is gated by the previous week's quiz. Score 80% (12 out of 15) and the next week unlocks automatically. Score lower and you get to retake — but the wrong answers come back with section links that point at the exact passages of reading that answer each question. That feedback loop is the whole reason for the gate.