allaboutspark
← Week 1: Foundation
Week 1 · Day 3 · 90-minute reading · 4 widgets · 15-question quiz

Shuffles, Partitioning & Persistence

The performance trio: what really happens during a shuffle, how partitioning controls parallelism, and when caching actually helps.

Locked

Pass Day 2 to unlock this.

Each day of the study path opens after you score 80% or higher on the previous day's quiz. It's not gatekeeping — later days build directly on the ones before, and the quiz is the cheapest way to find out whether the foundation is in place.

Go to Day 2

What you'll cover on Day 3

Once live, Day 3 runs roughly 90 minutes of reading paired with 4 interactive visualizations, followed by a 15-question self-check quiz. The reading is grounded in the official Apache Spark documentation — every claim cites the docs.

  • Anatomy of a shuffle: map side, disk, network, reduce side
  • Why shuffles are expensive (network, disk, serialization, GC)
  • Partitioning depth and the Goldilocks problem
  • Hash partitioner, range partitioner, custom partitioners
  • repartition vs coalesce
  • Data skew — the silent performance killer
  • Storage levels and choosing the right one
  • Checkpointing — the bigger hammer

Why this day matters

By the end of Day 3 you'll be able to explain shuffles, partitioning & persistence confidently — not just describe it, but reason about edge cases, predict performance, and read a Spark UI for the concepts it touches. That's the bar this study path aims for: not memorization, but the kind of working understanding that lets you debug real jobs.