Joins in Spark

Great to check the reshuffling and broadcast logic in the author's slides! Use them as references if needed to check!

Step Snap 1: [Mastering Spark Join Strategies]

Imagine organizing a massive party where guests (data) from different venues (tables) need to meet up (join). How do we make this happen efficiently? Let's explore Spark's join strategies!

🌟 Join Strategy #1: The Big Party Meetup (Joining Two Large Tables) Think of two large groups from different cities meeting up:

Everyone needs to travel to a common location (Shuffle)
Traffic (data transfer) can get heavy
Need proper planning (partitioning) to avoid chaos

# Example of a standard shuffle join
large_table1.join(large_table2, "common_key")

🌟 Join Strategy #2: The Sorted Line Dance (Merge Sort Join) Like two lines of dancers already arranged by height:

Both lines are pre-sorted (data sorted by join key)
Partners can easily find each other by walking forward (sequential matching)
Very efficient, but requires pre-sorted data

# Enable sort merge join
spark.conf.set("spark.sql.join.preferSortMergeJoin", "true")
# Join with sorted data
sorted_table1.join(sorted_table2, "join_key")

🌟 Join Strategy #3: The VIP List Strategy (Broadcast Join) Like having a small VIP list at multiple party entrances:

One small list (table) copied to all security checkpoints (nodes)
Everyone from the main party (large table) can be checked quickly
No need for people to move between entrances (reduced shuffle)

from pyspark.sql.functions import broadcast
# Broadcast the smaller table
large_table.join(broadcast(small_table), "join_key")