Week 1 · Day 3 · 90-minute reading · 4 widgets · 15-question quiz

Shuffles, Partitioning & Persistence

The performance trio: what really happens during a shuffle, how partitioning controls parallelism, and when caching actually helps.

Locked

Pass Day 2 to unlock this.

Each day of the study path opens after you score 80% or higher on the previous day's quiz. It's not gatekeeping — later days build directly on the ones before, and the quiz is the cheapest way to find out whether the foundation is in place.

Go to Day 2

What you'll cover on Day 3

Once live, Day 3 runs roughly 90 minutes of reading paired with 4 interactive visualizations, followed by a 15-question self-check quiz. The reading is grounded in the official Apache Spark documentation — every claim cites the docs.

Anatomy of a shuffle: map side, disk, network, reduce side
Why shuffles are expensive (network, disk, serialization, GC)
Partitioning depth and the Goldilocks problem
Hash partitioner, range partitioner, custom partitioners
repartition vs coalesce
Data skew — the silent performance killer
Storage levels and choosing the right one
Checkpointing — the bigger hammer

Why this day matters

By the end of Day 3 you'll be able to explain shuffles, partitioning & persistence confidently — not just describe it, but reason about edge cases, predict performance, and read a Spark UI for the concepts it touches. That's the bar this study path aims for: not memorization, but the kind of working understanding that lets you debug real jobs.