Apache Spark follows a lazy evaluation model, where Transformations define how data should be processed but do not execute immediately. Instead, execution happens only when an Action is triggered.
Transformations create a new RDD/DataFrame from an existing one but do not execute immediately.
df_selected = df.select("name", "age")  # Select specific columns
df_filtered = df_selected.filter(df_selected.age > 18)  # Filter rows
πΉ No execution happens yet! Spark just builds a logical execution plan.
π Key Transformations in Spark:
| Transformation | Description | 
|---|---|
| select() | Selects specific columns | 
| filter() | Filters rows based on a condition | 
| groupBy() | Groups data for aggregation | 
| map() | Applies a function to each row | 
| join() | Joins two DataFrames | 
Actions trigger execution of the DAG (Directed Acyclic Graph) and return results.
df_filtered.show()  # Executes the entire pipeline and displays results
πΉ Now, Spark actually runs the transformations and computes the result.
π Common Spark Actions:
| Action | Description | 
|---|---|
| show() | Displays DataFrame content | 
| collect() | Brings data to the driver as a list | 
| count() | Counts the number of rows | 
| take(n) | Fetches first nrows | 
| write.format("csv").save("path") | Saves DataFrame to storage | 
β Transformations are lazy β they donβt execute immediately.