The code below loads Parquet data and processes it through partitions, showing significant partition size imbalance:
df_green = spark.read.parquet('/home/paradx/data/pq/green/*/*')
duration_rdd = df_green \\\\
.select(columns) \\\\
.rdd
def apply_model_in_batch(partition):
cnt = 0
for row in partition:
cnt = cnt + 1
return [cnt]
duration_rdd.mapPartitions(apply_model_in_batch).collect()
Current partition sizes:
[335827, 311259, 311019, 183043, 110517, 110331, 109373, 107836, 107864, 104436, 103052, 97921, 91386, 95395, 89646, 35612]
The data clearly shows partition skew - some partitions contain up to 10x more records than others, which can lead to inefficient processing as some executors work much harder than others.
While repartition()
can redistribute data evenly, it has significant costs:
1. Coalesce Instead of Repartition
For reducing partition count without a full shuffle:
# Instead of: df.repartition(8)
balanced_df = df_green.coalesce(8)
Benefits: