GroupBy in Spark

Well, very nice explanation by the author again! Just paste the screenshots here to make sure we can refer to them when needed.

Step Snap 1: [Understanding Spark Reshuffling]

Imagine a busy logistics center: goods arrive from different warehouses (original data distribution), need to be sorted by destination (Reshuffling process), and finally delivered to corresponding locations (data redistribution). This is the core idea of Spark Reshuffling.

🌟 Why Reshuffling?

Think of a large library organizing books:

Initially, books are randomly distributed across shelves (data across partitions)
To study specific topics, we need all related books in one place (data grouping)
This reorganization is exactly what Reshuffling does in Spark

# Example of when reshuffling occurs
df = spark.createDataFrame([("Physics", 1), ("Chemistry", 2), ("Physics", 3)])
df.groupBy("subject").sum()  # Triggers reshuffling

🔄 How Reshuffling Works

Shuffle Write Phase (The Sorting Stage)
- Like postal workers sorting packages into different bags by zip code
- Each worker (partition) prepares data for redistribution
- Data is temporarily stored locally

# Configure shuffle write behavior
spark.conf.set("spark.shuffle.file.buffer", "32k")
spark.conf.set("spark.shuffle.spill.compress", true)

Shuffle Read Phase (The Collection Stage)
- Similar to each post office collecting all packages for its zone
- Each new partition pulls its required data from across the cluster
- Data is merged at destination

# Example of a shuffle operation with custom partitioning
df.repartition(col("category"))
   .write.mode("overwrite")
   .parquet("output_path")

🚨 Common Challenges

Data Skew (The Rush Hour Problem)
- Like one post office getting overwhelmed with packages while others are idle
- Some partitions receive significantly more data than others

# Handle data skew with salting
df = df.withColumn("salt", rand()*10)
      .withColumn("key_salted", concat(col("key"), lit("_"), col("salt")))