Step Snap 1: [Transformations vs Actions]

What Are Transformations and Actions in Spark?

Apache Spark follows a lazy evaluation model, where Transformations define how data should be processed but do not execute immediately. Instead, execution happens only when an Action is triggered.

Spark Transformations (Lazy Execution)

Transformations create a new RDD/DataFrame from an existing one but do not execute immediately.

Example of Lazy Transformations:

df_selected = df.select("name", "age")  # Select specific columns
df_filtered = df_selected.filter(df_selected.age > 18)  # Filter rows

πŸ”Ή No execution happens yet! Spark just builds a logical execution plan.

πŸ“Œ Key Transformations in Spark:

Transformation Description
select() Selects specific columns
filter() Filters rows based on a condition
groupBy() Groups data for aggregation
map() Applies a function to each row
join() Joins two DataFrames

Spark Actions (Eager Execution)

Actions trigger execution of the DAG (Directed Acyclic Graph) and return results.

Example of an Action Triggering Execution:

df_filtered.show()  # Executes the entire pipeline and displays results

πŸ”Ή Now, Spark actually runs the transformations and computes the result.

πŸ“Œ Common Spark Actions:

Action Description
show() Displays DataFrame content
collect() Brings data to the driver as a list
count() Counts the number of rows
take(n) Fetches first n rows
write.format("csv").save("path") Saves DataFrame to storage

Key Takeaways

βœ… Transformations are lazy – they don’t execute immediately.