Step Snap 1: [Transformations vs Actions]

What Are Transformations and Actions in Spark?

Apache Spark follows a lazy evaluation model, where Transformations define how data should be processed but do not execute immediately. Instead, execution happens only when an Action is triggered.

Spark Transformations (Lazy Execution)

Transformations create a new RDD/DataFrame from an existing one but do not execute immediately.

Example of Lazy Transformations:

df_selected = df.select("name", "age")  # Select specific columns
df_filtered = df_selected.filter(df_selected.age > 18)  # Filter rows

🔹 No execution happens yet! Spark just builds a logical execution plan.

📌 Key Transformations in Spark:

Transformation	Description
`select()`	Selects specific columns
`filter()`	Filters rows based on a condition
`groupBy()`	Groups data for aggregation
`map()`	Applies a function to each row
`join()`	Joins two DataFrames

Spark Actions (Eager Execution)

Actions trigger execution of the DAG (Directed Acyclic Graph) and return results.

Example of an Action Triggering Execution:

df_filtered.show()  # Executes the entire pipeline and displays results

🔹 Now, Spark actually runs the transformations and computes the result.

📌 Common Spark Actions:

Action	Description
`show()`	Displays DataFrame content
`collect()`	Brings data to the driver as a list
`count()`	Counts the number of rows
`take(n)`	Fetches first `n` rows
`write.format("csv").save("path")`	Saves DataFrame to storage

Key Takeaways

✅ Transformations are lazy – they don’t execute immediately.