Apache Spark follows a lazy evaluation model, where Transformations define how data should be processed but do not execute immediately. Instead, execution happens only when an Action is triggered.
Transformations create a new RDD/DataFrame from an existing one but do not execute immediately.
df_selected = df.select("name", "age") # Select specific columns
df_filtered = df_selected.filter(df_selected.age > 18) # Filter rows
πΉ No execution happens yet! Spark just builds a logical execution plan.
π Key Transformations in Spark:
Transformation | Description |
---|---|
select() |
Selects specific columns |
filter() |
Filters rows based on a condition |
groupBy() |
Groups data for aggregation |
map() |
Applies a function to each row |
join() |
Joins two DataFrames |
Actions trigger execution of the DAG (Directed Acyclic Graph) and return results.
df_filtered.show() # Executes the entire pipeline and displays results
πΉ Now, Spark actually runs the transformations and computes the result.
π Common Spark Actions:
Action | Description |
---|---|
show() |
Displays DataFrame content |
collect() |
Brings data to the driver as a list |
count() |
Counts the number of rows |
take(n) |
Fetches first n rows |
write.format("csv").save("path") |
Saves DataFrame to storage |
β Transformations are lazy β they donβt execute immediately.